It is interesting to note that a substantial subset of the computer science community has redefined their research agenda to fit under the marketing banner of "Big Data." As such, it is clearly the "buzzword du jour." As somebody who has been working on database problems for a very long time (which, by definition, deal with big data), I would like to use a sequence of four blog posts to explain what I think "big data" means, and discuss what I see as the research agenda.
In the community I travel in, big data can mean one of four things:
Big volumes of data, but "small analytics." Here the idea is to support SQL on very large data sets. Nobody runs "Select*" from something big as this would overwhelm the recipient with terabytes of data. Instead, the focus is on running SQL analytics (count, sum, max, min, and avg with an optional group_by) on large amounts of data. I term this "small analytics" to distinguish this use case from the one which follows.
Big analytics on big volumes of data. By big analytics, I mean data clustering, regressions, machine learning, and other much more complex analytics on very large amounts of data. At the present time users tend to run big analytics using statistical packages, such as R, SPSS and SAS. Alternately, they use linear algebra packages such as ScalaPack or Arpack. Lastly, there is a fair amount of custom code (roll your own) used here.
Big velocity. By this I mean being able to absorb and process a fire hose of incoming data for applications like electronic trading, real-time ad placement on Web pages, real-time customer targeting, and mobile social networking. This use case is most prevalent in large Web properties and on Wall Street, both of whom tend to roll their own.
Big variety. Many enterprises are faced with integrating a larger and larger number of data sources with diverse data (spreadsheets, Web sources, XML, traditional DBMSs). Many enterprises view this as their number one headache. Historically, the extract, transform, and load (ETL) vendors serviced this market on modest numbers of data sources.
In summary, big data can mean big volume, big velocity, or big variety. In the remainder of this post, I talk about small analytics on big volumes of data. In three subsequent posts, I will discuss the other three problem areas.
Big Volume, Small Analytics
I am aware of more than five multi-petabyte data warehouses in production use running on three different commercial products. No doubt there are a couple of dozen more. All are running on "shared nothing" server farms with north of 100 usually "beefy" nodes, survive hardware node failures through failover to a backup replica, and perform a workload consisting of SQL analytics as defined above. All report operational challenges in keeping a large configuration running, and would like new DBMS features. Number one on everybody’s list is resource elasticity (i.e., add 50 more servers to a system of 100 servers, automatically repartitioning the data to include the extra servers, all without taking down time and without interrupting query processing). In addition, better resource management is also a common request. Here, multiple cost centers are sharing a common resource, and everybody wants to get their fair share. The pundits--for example, Curt Monash--often identify some of these data warehouses.
A second solution to this use case appears to be Hive/Hadoop. I know of a couple of multi-petabyte repositories using this technology, most notably Facebook. Again, there are probably a couple of dozen more, and I know of many IT shops who are prototyping this solution. There have been quite a few papers in the recent literature documenting the inefficiency of Hadoop, compared to parallel DBMSs. In general, you should expect at least an order of magnitude performance difference. This will translate into an order of magnitude worse response time on the same amount of hardware or an order of magnitude more hardware to achieve the same performance. If the later course is chosen, this is a decision to buy a lot of iron and use a lot of power. As detailed in my previous blog post with Jeremy Kepner, I am not a big fan of this solution.
In addition, Google and other large Web properties appear to be running large configurations with this sort of workload on home-brew software. Some of it looks much like commercial RDBMSs (e.g., F1) and some of it looks quite different (e.g., BigTable).
Off into the future, I see the main challenge in this world to be 100% uptime (i.e. never go down, no matter what). Of course, this is a challenging "ops" problem. In addition, this will require the installation of new hardware, the installation of patches, and the next iteration of a vendor’s software, without ever taking down time. Harder still is schema migration without incurring downtime.
In addition, I predict that the SQL vendors will all move to column stores, because they are wildly faster than row stores. In effect, all row store vendors will have to transition their products to column stores over time to be competitive. This will likely be a migration challenge to some of the legacy vendors.
Lastly, there is a major opportunity in this space for advanced storage ideas, including compression and encryption. Sampling to cut down query costs is also of interest.
In addition to being an adjunct professor at the Massachusetts Institute of Technology, Michael Stonebraker is associated with four startups that are either producers or consumers of database technology.