I was at a conference recently and talked with a science professor at another university. He made the following startling statement.
He has close to 1 petabyte (PB) of data that he uses in his research. In addition, he surveyed other scientific research groups at his university and found 19 other groups, each with more than 100 terabytes (TB) of data. In other words, 20 research groups at his university have data sets between 100 TB-1 PB in size.
I immediately said, "Why not ask your university's IT services to stand up a 20-petabyte cluster?"
His reply: "Nobody thinks they are ready to do this. This is research computing, very different from regular IT. The tradeoffs for research computing are quite different from corporate IT."
I then asked, "Why not put your data up on EC2?" [EC2 is Amazon's Elastic Compute Cloud service.]
His answer: "EC2 storage is too expensive for my research budget; you essentially have to buy your storage every month. Besides, how would I move a PB to Amazon? Sneaker net [disks sent to Amazon via U.S. mail] is not very appealing."
As a result, he is in the process of starting a 20-research group federation that will stand up the required server. In other words, this consortium will run its own massive data server.
I am reminded of a talk given a couple of years ago by James Hamilton, then at Amazon. He claimed there are unbelievable economies of scale in running grid-oriented data centers (i.e., if you run 100,000 nodes, your costs are a small fraction of the costs of running a 1000-node data center). Many of these cost savings come from unexpected places. For example, designing a physical data center (raised flooring, uninterrupted power supply, etc.) is something the small guy does once and the big guy has down to a science. Also, personnel costs rise much more slowly than the number of nodes.
I assume at least 20 universities have the same characteristics as the one noted above. Also, my assumption is these 20 x 20 = 400 research groups that get their funding from a small number of government agencies. It would make unbelievably good sense to have a single 400-PB system that all of the researchers share.
In effect, this blog post is a "call to arms." Agencies of the U.S. government are spending boatloads of money on pushing the envelope of massive compute servers. However, they appear to be ignoring the fact that many research groups have serious data-management problems.
Why not invest a small fraction of the "massive computing" budget on "massive data management"? Start by standing up a 400-PB data server run by somebody who understands big data. Several organizations with the required expertise come readily to mind. This would be a much better solution than a whole bunch of smaller systems run by consortiums of individual science groups.
There must be a better way. After all, the problem is only going to get worse.
Interesting ... and perhaps quite timely, given the recent NSF announcement about its new data management policy:
Question #5 - and its answer - in the DMP FAQ suggests there may be some support [at least for U.S. research [funded by NSF]]
"5. Should the budget and its justification specifically address the costs of implementing the Data Management Plan?
[Y]es. As long as the costs are allowable in accordance with the applicable cost principles, and necessary to implement the Data Management Plan, such costs may be included (typically on Line G2) of the proposal budget, and justified in the budget justification."
As an alternative, I wonder if perhaps Amazon, Microsoft or some other large-scale provider of large-scale data management services might be inclined to donate or offer steep educational discounts to universities for some level of service.