My Top 10 Assertions About Data Warehouses

MIT Adjunct Professor Michael Stonebraker

Data warehouses, business intelligence, business analytics, and complex analytics are the subject of increasingly intense marketing wars, seemingly accelerated by Oracle’s recent introduction of the Exadata appliance. Here is my spin on the situation, organized as my top 10 assertions about this topic. Please note that I have a financial interest in several database companies, and may be biased in a number of different ways. The reader should always keep this in mind.

Assertion 1: Star and snowflake schemas are a good idea in the data warehouse world.

In short, data warehouses store a large collection of facts. The overwhelming majority of these facts are the "5 Ws" (who, what, where, when, why) along with a collection of attributes about the fact. For example, a typical retail organization stores facts about historical transactions. These facts include the customer (who), the retail store (where), the time of the sale (when), and the purchased product (what), along with attributes of the sale (e.g., price, sales tax, credit card used, etc.).

One should organize such data as shown in Figure 1:

                Customer (c-key, c-attributes)           Time (t-key, t-attributes)
                                               |                                                     |
                                               |       |
                                               Fact (c-key, s-key, t-key, p-key, attributes)
                                               |                                           |
                                               |                                             |
                Store (s-key, s-attributes)    Product (p-key, p-attributes)

A Diagram of a Star Schema
Figure 1

Such a schema is called a star schema, with a central Fact table and surrounding Dimension tables. If stores are organized into divisions, then the star schema has another table between Store and Fact, and becomes a snowflake schema. Star and Snowflake schemas are clean, simple, easy to parallelize, and usually result in very high-performance database management system (DBMS) applications.

If you are a data warehouse designer and come up with something other than a snowflake schema, you should probably rethink your design.

However, you will often come up with a design having a large number of attributes in the fact table; 40 attributes are routine and 200 are not uncommon. Current data warehouse administrators usually stand on their heads to make "fat" fact tables perform on current relational database management systems (RDBMSs). This leads to my second assertion.

Assertion 2: Column stores will dominate the data warehouse market over time, replacing row stores.

The problem with row stores is they store data in the fact table, row by row, on disk. A typical business intelligence query requires half-a-dozen attributes or less (e.g., find me the average price of widgets by store by month for the past two years). A row store will read all 200 attributes, even though only six are required. In contrast, a DBMS that organized data by column will read only the six required, a savings of a factor of 33.

Since fact tables are getting fatter over time as business analysts want access to more and more information, this architectural difference will become increasingly significant. Even when "skinny" fact tables occur or where many attributes are read, a column store is still likely to be advantageous because of its superior compression ability.

For these reasons, over time, column stores will clearly win. Note that almost all traditional RDBMSs are row stores, including Oracle, SQLServer, Postgres, MySQL, and DB2. Also, systems that implement PAX-style coding (type "Anastassia Ailamaki" into your Web browser for a substantial bibliography on this topic) on individual blocks, such as Exadata, are just changing the internal layout of a block, and do not alter the column store advantage noted above.

Assertion 3: The vast majority of data warehouses are not candidates for main memory or flash memory.

My experience is that data warehouses are increasing in size faster than storage is getting cheaper. Business analysts have an appetite for as much attribute data as they can get their hands on, and want to keep increasingly long periods of history. Hence, data warehouse problems are getting "net harder," not "net easier." Put differently, most data warehouses are measured in Gbytes today, Tbytes tomorrow, and Pbytes the next day.

It will take a long time before main memory or flash memory becomes cheap enough to handle most warehouse problems. As such, non-disk technology should only be considered for temporary tables, very "hot" data elements, or very small data warehouses.

Assertion 4: Massively parallel processor (MPP) systems will be omnipresent in this market.

Massively parallel processor systems (10’s now going on 100’s tomorrow) are the only kind of computer architecture that will scale to Pbytes. All vendors, with a very few exceptions, are or will soon support MPP. Don’t bet on anything that is not in the MPP camp.

Assertion 5: "No knobs" is the only thing that makes any sense.

It is pretty clear that human operational costs dominate the cost of running a data warehouse. This is mainly the system administration and database administration that is involved in keeping an MPP system up and in managing a Pbyte-sized warehouse. Database administrator (DBA) costs include designing schemas, reprovisioning databases to add or drop resources, adding and dropping users, tuning and retuning the DBMS, etc.

Almost all DBMSs have 100 or more complicated tuning "knobs." This requires DBAs to be "4-star wizards" and drives up operating costs. Obviously, the only thing that makes sense is to have a program that adjusts these knobs automatically. In other words, look for "no knobs" as the only way to cut down DBA costs.

Assertion 6: Appliances should be "software only."

In my 40 years of experience as a computer science professional in the DBMS field, I have yet to see a specialized hardware architecture—a so-called database machine—that wins.

In other words, one can buy general purpose CPU cycles from the major chip vendors or specialized CPU cycles from a database machine vendor. Since the volume of the general purpose vendors are 10,000 or 100,000 times the volume of the specialized vendors, their prices are an order of magnitude under those of the specialized vendor. To be a price-performance winner, the specialized vendor must be at least a factor of 20−30 faster.

I have never seen a specialized hardware architecture that is faster by this amount. However, every decade several vendors try (and fail).

Put differently, I think database appliances are a packaging exercise—i.e., preconfigure general purpose hardware and preload the DBMS on it. This results in a software-only appliance.

Assertion 7: Hybrid workloads are not optimized by "one-size fits all."

If one has a workload that is part online transaction processing (OLTP) and part data warehouse, then he or she has two options:

1) Run a general purpose RDBMS that stores both kinds of data;

2) Run two systems, an OLTP engine and a data warehouse engine, coupled together with a high-speed interchange to move operational data into the data warehouse.

Row stores are not good at data warehouse applications (see assertion 1). Column stores are optimized for data warehouses and are not good at OLTP. Hence, neither is a good candidate for a "one size fits all" implementation. Instead, there are a number of interesting new ideas to accelerate OLTP, including main-memory SQL engines, main memory caches, flash systems, etc. When coupled with a column store in a two-system configuration, I assert the result will be a factor of 50 or so faster than solution 1. Put differently, two specialized systems can each be a factor of 50 faster than the single "one size fits all" system in solution 1.

A factor of 50 is nothing to sneeze at.

Assertion 8: Essentially all data warehouse installations want high availability (HA).

As noted above, data warehouses are getting very big. If there is data corruption in a 10 Tbyte warehouse, then recovering the database from the database log will take a very long time. Of course, the exact amount of time depends on log formats, number of disks allocated to the log, etc. However, restoring a 10 Tbyte dump from 10 disks is likely to take hours, not minutes, and then one has to perform a roll forward from the dump. Hardly anybody is interested in taking this kind of down time.

Instead, most installations use replication and fail over to a replica (Tandem computer style) if there is data corruption. Then, the corrupted copy can be rebuilt as a background task. Hence, HA is used for recovery, not the DBMS log. Obviously, this requires the DBMS to support HA; otherwise, it is a manual DBA hassle to accomplish the same thing in user code. Moreover, there is no reason to write a DBMS log if this is going to be the recovery tactic. As such, a source of run-time overhead can be avoided.

Assertion 9: DBMSs should support online reprovisioning.

Not always, but often, I hear a request for online reprovisioning. In other words, one initially allocates 10 nodes to accomplish warehouse processing. The load later rises, and the desire is to allocate 20 nodes to the task originally done by 10. This requires the database to be repartitioned over double the number of nodes.

Hardly anybody wants to take the required amount of down time to dump and reload the DBMS. Likewise, it is a DBA hassle to do so. A much better solution is for the DBMS to support reprovisioning, without going offline. Few systems have this capability today, but vendors should be encouraged to move quickly to provide this feature.

Assertion 10: Virtualization often has performance problems in a DBMS world.

I hear many users say their long-term goal is to move to the cloud, whether using the public cloud (say, EC2) or inside the firewall on an "enterprise cloud." Here, a collection of servers is allotted to several-to-many DBMS applications inside an enterprise firewall. Often, such systems are managed by using virtualization software to present virtual nodes to the DBMS and its applications.

My experience is that CPU resources can be virtualized with modest overhead (say, 10%). However, data warehouses entail disk-based data. In this world, all MPP DBMSs "move the query to the data," (i.e., they send constituent pieces of a composite query to various nodes for local processing). Obviously, this requires knowing the physical data distribution. Virtualization will destroy this knowledge, and turn what were originally reads to a local disk into reads to non-local disks. In other words, local I/O gets replaced by remote I/O, with an obvious significant performance hit.

Until better and cheaper networking makes remote I/O as fast as local I/O at a reasonable cost, one should be very careful about virtualizing DBMS software. Note clearly that data warehouse DBMSs are reading disks allocated to them at, say, half the disk bandwidth. Hence, a 10 disk data warehouse application will require multi-gigabit bandwidth. Now, put 10 such applications on an enterprise cloud, and you are approaching terabit speeds. Reasonable cost networking at this speed is still a ways off.

Of course, the benefits of a virtualized environment are not insignificant, and they may outweigh the performance hit. My only point is to note that virtualizing I/O is not cheap.

The same general comment also applies to storage area network (SAN) installations. If an MPP DBMS runs against SAN storage, then local I/O is replaced by remote I/O, over whatever interconnect the SAN vendor supports. Again, the cost may not be insignificant.

Disclosure: In addition to being an adjunct professor at the Massachusetts Institute of Technology, Michael Stonebraker is associated with four startups that are either producers or consumers of database technology.