As an analyst, I cut my teeth writing about database. In the 1990s the database market was hugely active with a mass of products; Adabas, Allbase,  Datacomm DB, DB2, Empress, IDMS,   Informix, Ingres, Interbase, Oracle, Progress, RDB, Sybase, Unify, and quite a few more. Object databases were introduced in 1993 or so, and then the object relational database emerged. Microsoft got into the game by licensing the Sybase code. Database benchmarking contests consumed column inches of IT news and then Very Large Database (VLDB) became a focus and Teradata, Red Brick and Tandem Non-stop SQL got air-time.

It was a very active market and then, all of a sudden, it wasn’t. Some products atrophied, companies ate other companies and eventually the marketing contest was decided. Three very big players: Oracle, IBM (DB2) and Microsoft (MS SQL Server) became dominant and most other products scratched out a living if they could. That’s what happens with lively software markets, eventually they settle down with a few large vendors sharing the spoils. The only development that disturbed the comfortable hegemony of the big three was the gradual success of MySQL, which was Open Source and optimized for serving web pages. It dominated a small market that soon became a big market.

Vertica and Verticality

If you put specialist database applications to one side, you can view the database world as consisting of 3 big markets:

  1. Commodity Database: For undemanding database applications pretty much any viable rdbms will do. May as well stick with Oracle, DB2 or SQL Server, whichever your company uses most.
  2. Large OLTP Databases: These are for very heavy transaction workloads. The big 3 are used here, but also some older products, IDMS and Datacomm for example, run big OLTP systems.
  3. Data Warehouse: These databases hold masses of data (in the terabyte area). Oracle, DB2 and SQL Server are used in such applications but so are more specialist products including Teradata, Green Plum, NeoView, Netezza and one or two others.

As far as Data Warehouse is concerned, there is now a new kid in town, that goes by the name of Vertica. The team that created Vertica is led by the man who can claim to be the most prolific database author in history, Prof Michael Stonebraker. The list of databases he has worked on  includes; Ingres, Postgres, Illustra, Cohera, StreamBase and now Vertica. When something new comes from the Stonebraker stable it is usually worth looking at - and Vertica is definitely worth looking at.

The simplest way to explain why Vertica is compelling is to describe the thinking behind it.

  • First of all, the vast majority of data warehouses are built using a simple snowflake schema. The important point about such a database is that the vast majority of the data (95% +) is held in a single “fact table”. The fact table will hold the normal transactions of the business. For banking that would be account transactions, for a telco it’s telephone calls, for a retailer its individual items purchased and so on. Nearly all the transactions run against the data warehosue will involve pulling data from the fact table.
  • Vertica is a database that focuses almost entirely on pulling data from very very big tables. It optimizes queries that do precisely that. It does so by organizing the data into columns and processing on a column by column basis. I could explain why this is dramatically faster than the way relational databases typically process database tables, but there is not the space here. Take my word for it, it flies like a bullet from a gun. Depending on circumstances, it can be 100 times faster.
  • There are products like those I mentioned (Teradata, NeoView, etc.) that do the data warehouse job faster than Oracle/DB2/SQL Server, but not 100 times faster. 100 times faster is “two orders of magnitude” and that is huge. It means that a query that previously took 3 hours, now takes about 100 seconds. For data warehouse that’s game changing.

Vertica Characteristics

In actual fact, there’s nothing that Vertica is doing that is new in terms of database techniques. It uses column processing (not new), it compresses columns (not new), it is highly parallel (not new), it is optimized for a particular workload (not new), it’s shared nothing (not new), it works on commodity hardware (not new), it is entirely self-tuning (not new). However, when you put all of this not-newness together you get data warehouse characteristics that are distinctly new. Here’s a list:

  1. Requires no DBAs
  2. Self scaling (to scale up, it kinda scales out)
  3. Always on (loading is concurrent with processing)
  4. Near real-time data (indeed so, if compared to anything else)
  5. Configurable to be fully fault tolerant
  6. Very much faster
  7. Considerably reduced cost
  8. Can work in the cloud

The last of these characteristics is doubly worthy of comment and it is the most exraordinary aspect of Vertica. I have to confess that, although I eventually expected data warehouse to be offered as a cloud-based capability, I never expected it any time soon. Vertica is partnering with Amazon to deliver it now - at a price of $2000 per month for half a terabyte of data.

If you know anything about the costs of a data warehouse then you will think I’ve misquoted the price, but I haven’t. You really can commence building a highly scalable always available data warehouse and have it up and running in a day or two (if you’ve got the data ready) for a few thousand dollars.

If that isn’t revolutionary, then what is?

  Subscribe to HaveMacWillBlog in a reader