the Garden of Forking Paths: Riak

Life is full of tradeoffs. I once had an ancient barber (and man of few words) who kept a sign on the wall. It said: "Fast, Good, or Cheap: Pick Two." After many fast, cheap, and bad haircuts, I've found that this aphorism applies not only to coiffuring.

You can find tradeoffs like this anywhere you look. When it comes to data storage, the tradeoff is between consistency, availability, and partition tolerance. According to a famous conjecture by Eric Brewer--and later formally proven by Seth Gilbert and Nancy Lynch--you can only pick two of the three.

In any production environment, you have distributed data sets stored on a number of different nodes. The edge case is one node, but it's most likely (I hope) that you have data distributed across two or more. MS SQL Server, for example, lets you easily set up replication across two nodes in an active-passive model (Microsoft calls this mirroring). If one node goes down, the other takes over. When the first comes back on, it syncs with the second. Amazon's Dynamo replicates over N nodes, providing a highly configurable and fault-tolerant environment. If a node goes down, its data will be replicated to others. But you don't have to go that far. A tape backup could be considered a high-latency node and might be sufficient, depending on your needs.

Before talking about the tradeoffs, let's consider a few definitions.

Consistency refers to the degree to which nodes in a distributed system agree about a data set. In a 100% consistent system, all nodes see the same data at the same instant. Consistency could be measured as a fraction of the number of nodes. Given 5 nodes where 4 agree, you'd have 80% consistency.

Availability refers to the responsiveness of a system. In a 100% available system, all requests will receive responses. Availability is usually called uptime and is measured in units of time. In order to increase availability, you typically increase the number of nodes in your network.

Finally, partition tolerance refers to the ability of the system to handle disruptions in network communication between nodes. A 100% partition tolerant system would be able to work without any network communication. That is, it wouldn't be a distributed system.

From these definitions, you should be able to guess at why you can't have all three. Rather than offering a formal proof, I think it's easiest to see why by considering each combination in turn.

A consistent and available system (CA) will always give you the same answer, and it will always respond to requests. However, it won't be distributed. The token CA system is a single database, whether it be PostgreSQL or Redis. As long as you use some kind of transaction isolation level, you don't have to worry about inconsistencies in your data, and you don't have to build any application logic to shard data to different databases. As your data grows, however, this system may not suffice.

A consistent and partition-tolerant system (CP) is a distributed system that enforces requirements regarding the number of nodes that must be updated in order for a write to be considered successful. HBase and MongoDB do this, for example. If nodes go down, they will replicate data to other nodes. However, because they have consistency guarantees, it is possible that they may refuse reads or writes in the event of a network failure. For example, if you have a three-node system that requires two nodes to agree, and you have a network error in which one node is separated from the other two, that node will not be available. Any clients connecting to it will be denied service.

Finally, an available and partition-tolerant system (AP) replicates data all over the place so it can't possibly guarantee consistency. For example, CouchDB and Riak embrace fault-tolerance over consistency. The idea is that they will be eventually consistent, but they might not be consistent immediately following a write or network outage.

Now, almost any system can be CAP if you're willing to wait. The tradeoff only exists when we're talking about a single instant. If a system takes only a few ms to become consistent, this may be sufficient. The Internet's DNS will eventually be consistent, but there are no guarantees about how long this will take. Riak lets you tweak the number of nodes that must agree before a read or write is considered successful. Still, in a sufficiently large system, you may not be able to wait that long.

Diving into the NoSQL universe may be daunting, but a big part is determining where you fall in the CAP spectrum. Even if you're not considering a NoSQL database, I think it's useful to understand these tradeoffs. Are there any other tradeoffs facing distributed data solutions?

The next NoSQL database in my whirlwind tour is Apache HBase, a column-oriented database. Riak was a simple key-value database, which mapped an object it knew nothing about to a bucket/key combination. Its job was just to make sure the data was replicated consistently in a highly fault-tolerant manner. HBase schemas, on the other hand, are made up of tables, which are dictionaries of dictionaries. It gives you more granularity and schema control, but it's still dumb about the values of its key-value pairs.

For SQL developers, HBase terminology will not make much sense at first. The large objects are tables, which are made up of key-value pairs consisting of row keys and column families. The column families are, in turn, collections of key value pairs. A table is not a group of relations, nor are there any guarantees that one row will look much like another.

An HBase table

The value of this system is that you get another layer of granularity over a key-value store like Riak. You have a bunch of stuff (a table of rows), which has a couple broad characteristics (a row of column families). Listen, these NoSQL database systems are not all that different from each other. Which one is right for you will depend upon 1) your other applications, 2) your size needs, 3) your performance needs, and 4) your fault-tolerance needs.

For example, HBase cannot read until one client succeeds in writing to all replicated instances. Riak can be tweaked so that writes are successful when only one node has been written to, so it should be faster (though it's hard to find any data on these things--you'll have to do the POC yourself). HBase is best when you have a lot of data and need to run MapReduce. Riak is right if you need 99.999% uptime.

For me personally, HBase was a pain in the ass to set up. I'm not a Linux pro, and HBase isn't really meant to be run in a single instance on a crappy laptop, but this was a real turnoff. MongoDB, Riak, and PostgreSQL were all pieces of cake to set up. So why do it?

One reason is that HBase is part of an ecosystem of tried-and-tested enterprise applications, like the Hadoop Distributed File System and the query language Hive. Its users include Facebook, Twitter, and Yahoo! (Google uses its own BigTable.) There's a large community of users to go to with questions.

Still, HBase has the flavor of many Open Source systems that grew out of a matter of necessity of scaling fast on commodity hardware. I'll be interested to see what Microsoft does with Hadoop and HBase. Call me a Microsoft fan-boy, but I think their integration of technologies is often very good, and it takes time to work out the kinks in a new platform. The first iPhone was a piece of junk. It takes a few iterations for a technology to mature and solve the problems it's designed to solve.

Links:
-Check out this nice comparison of Cassandra, MongoDB, CouchDB, Redis, Riak, HBase, Membase, and Neo4j

Sunday, January 27, 2013

The CAP Theorem

Sunday, October 14, 2012

HBase: A Column-Family Store