What Does ‘Big Data’ Mean? (Part 3)

MIT Adjunct Professor Michael Stonebraker

In two previous blog postings, I talked about four possible meanings of "big data", namely

and discussed the first two use cases. In this posting, I continue with a discussion of the third use case. "Big velocity" means "drinking from a firehose," i.e. coping with data arriving at very high speed. Examples abound and include Wall Street market feeds, maintaining the state of massive multi-player games, web logs, ad placement on web pages, and the data collection systems for sensor data, such as traffic congestion, automobile insurance logs and the like.

Immediately, we can divide the high velocity space into two components. The first one requires real-time attention to the feed, while the second one allows one to collect large groups of messages for batch processing. The batch processing case can be handled by a lot of technologies, so this posting deals only with the case where one needs to take action in real time. For example, one must decide what ad to place on a web page in milliseconds; taking electronic trading actions on Wall Street feeds must be done equally quickly. In general, the whole world is moving toward a "real-time" orientation, so I expect that shops willing to do batch processing will decrease over time. Therefore, in the rest of this posting, I assume we must have sub-second response time to high volumes of real-time messages.

Seemingly, there are two cases of interest, which we illustrate with Wall Street examples. The first one is to find patterns that should trigger electronic trades. Although the rules that govern the actions of trading systems are considered "secret sauce," the simple case can be explained by an example. Suppose stocks X and Y are thought to be correlated. Hence, if either one rises (falls), then one should buy (sell short) the other one. In general, one looks for more complex patterns of this sort; for example, a rise in Stock X followed within 10 msec. by a fall in Stock Y. Hence, trading application are processing a fire hose of trades looking for a collection of patterns. When one is found, the appropriate action is taken.

The second example concerns an electronic trading company with trading engines at various locations around the world (e.g. Tokyo, New York, London, Frankfurt, etc.). Each trading system is independent of the others, so headquarters wants to maintain the real-time global position of the entire firm, i.e. the aggregate number of shares long or short for all securities. If any position is too large, then appropriate corrective action is initiated. Hence, for every trade anywhere in the world, a message is sent to a database at headquarters where the "state" of the firm's position is updated in real time.

The first application maintains essentially no state and is focused on complex patterns, while the second one deals primarily with maintaining state, and performs the same action on every message. Hence, the first is "big pattern – little state" while the second is "little pattern – big state." These two use cases are dealt with by very different software systems. Depending on which use case you have, you should be looking in different directions.

The first use case is in the "sweet spot" for complex event processing (CEP) systems, such as Esper, StreamBase, Tibco, and UC4. Such products offer easy-to-use workflow systems whereby complex patterns can be specified for real-time discovery.

The second use case looks much more like high performance on-line transaction processing (OLTP). In this case, there are three options that are routinely suggested, namely traditional RDBMSs, NoSQL engines, and NewSQL systems. We discuss the three options in turn. Legacy RDBMSs, such as DB2, Oracle, and SQLServer were designed (many years ago) as general purpose DBMSs, appropriate for all applications. As such, their performance on ingesting a firehose of real-time messages will almost certainly be inadequate, both in terms of throughput and latency. People with this use case dismiss the traditional RDBMSs out-of-hand for high velocity applications. Such users often turn to NoSQL solutions for better performance. The 75 or so NoSQL engines attempt to deliver better performance by giving up a high level language (SQL) in favor of low-level record-at-a-time interfaces and by giving up transactions.

Forty years of DBMS research has clearly demonstrated the advantages of high-level data languages (better data independence, better code maintainability, easier to understand code, no loss of performance). Hence, I believe that the NoSQL vendors are misguided in their belief that low level notations yield better performance. It is also curious to note that two of the major NoSQL vendors (MongoDB and Cassandra) are investing in high-level languages that look remarkably like SQL. As such, I tend to think that NoSQL in fact means "Not yet SQL."

The second point of the NoSQL vendors is that giving up ACID transactions yields better performance. This is assuredly true, as noted in [1]. However, there is increasing awareness that ACID consistency is good. Even Google, the bastion of "no ACID", seems to be reconsidering their opposition [2]. Failure to provide ACID means that application designers will have to spend a lot of code ensuring whatever data consistency guarantees they require. The other argument against ACID concerns availability, and stems from the well-known CAP theorem [3], which I discussed in a previous BLOG@CACM posting [4]. Higher availability (especially in a wide-area network environment) will be achieved with eventual consistency of replicas, instead of ACID replicas. However, it should be clearly noted that eventual consistency actually means "creates garbage" when there are non-commutative updates or integrity constraints on the data base. This is discussed further in [2]. Hence, eventual consistency only works for a subset of applications. If you don't have one of these, then the decision to run a non-ACID DBMS is a decision to tear your hair out. As such, the NoSQL vendors offer somewhat better performance and availability in exchange for giving up ACID. If you need ACID, now or in the future, this is likely to be a bad trade.

The third option is to use a NewSQL system. These are available from MemSQL, NuoDB, SolidDB, TimesTen, VoltDB, and others. This class of vendors retains SQL and ACID transactions and offers dramatically better performance than traditional RDBMSs by making very different architectural choices. Often, NewSQL engines are explicitly main memory DBMSs, and most use some other concurrency control scheme than dynamic locking. In addition, some use logical (command) logging rather than traditional data logging, and often rely on a single threaded architecture to avoid the cost of latching shared data structures (such as B-trees). As such, this class of vendors presents an alternative way to obtain way better performance, while retaining ACID and SQL.

In summary, if you can live with high latency (and a lesser fraction of the world will be willing to do so over time), then there are lots of solutions to high velocity processing. Otherwise, you should look to a CEP system or an OLTP system. In the latter case, the traditional RDBMS vendors are a non-starter on high velocity "stateful" applications, and it remains to be seen how the NewSQL and NoSQL vendors will fare.

References

[1] nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf

[2] https://www.usenix.org/system/files/conference/osdi12/osdi12-final-16.pdf

[3] Eric Brewer, "Towards Robust Distributed Systems,"
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

[4] Michael Stonebraker, "Errors in Database Systems, Eventual Consistency, and the CAP Theorem," BLOG@CACM, April 5, 2010

Disclosure

Michael Stonebraker is an adjunct professor at the Massachusetts Institute of Technology, and is associated with four startups that are either producers or consumers of database technology.