Sunday, October 14, 2012

HBase: A Column-Family Store

The next NoSQL database in my whirlwind tour is Apache HBase, a column-oriented database. Riak was a simple key-value database, which mapped an object it knew nothing about to a bucket/key combination. Its job was just to make sure the data was replicated consistently in a highly fault-tolerant manner. HBase schemas, on the other hand, are made up of tables, which are dictionaries of dictionaries. It gives you more granularity and schema control, but it's still dumb about the values of its key-value pairs.

For SQL developers, HBase terminology will not make much sense at first. The large objects are tables, which are made up of key-value pairs consisting of row keys and column families. The column families are, in turn, collections of key value pairs. A table is not a group of relations, nor are there any guarantees that one row will look much like another.

An HBase table

The value of this system is that you get another layer of granularity over a key-value store like Riak. You have a bunch of stuff (a table of rows), which has a couple broad characteristics (a row of column families). Listen, these NoSQL database systems are not all that different from each other. Which one is right for you will depend upon 1) your other applications, 2) your size needs, 3) your performance needs, and 4) your fault-tolerance needs.

For example, HBase cannot read until one client succeeds in writing to all replicated instances. Riak can be tweaked so that writes are successful when only one node has been written to, so it should be faster (though it's hard to find any data on these things--you'll have to do the POC yourself). HBase is best when you have a lot of data and need to run MapReduce. Riak is right if you need 99.999% uptime.

For me personally, HBase was a pain in the ass to set up. I'm not a Linux pro, and HBase isn't really meant to be run in a single instance on a crappy laptop, but this was a real turnoff. MongoDB, Riak, and PostgreSQL were all pieces of cake to set up. So why do it?

One reason is that HBase is part of an ecosystem of tried-and-tested enterprise applications, like the Hadoop Distributed File System and the query language Hive. Its users include Facebook, Twitter, and Yahoo! (Google uses its own BigTable.) There's a large community of users to go to with questions.

Still, HBase has the flavor of many Open Source systems that grew out of a matter of necessity of scaling fast on commodity hardware. I'll be interested to see what Microsoft does with Hadoop and HBase. Call me a Microsoft fan-boy, but I think their integration of technologies is often very good, and it takes time to work out the kinks in a new platform. The first iPhone was a piece of junk. It takes a few iterations for a technology to mature and solve the problems it's designed to solve.

Links:
-Check out this nice comparison of Cassandra, MongoDB, CouchDB, Redis, Riak, HBase, Membase, and Neo4j

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...