the Garden of Forking Paths: Scalability

Showing posts with label Scalability. Show all posts

Sunday, January 27, 2013

The CAP Theorem

Life is full of tradeoffs. I once had an ancient barber (and man of few words) who kept a sign on the wall. It said: "Fast, Good, or Cheap: Pick Two." After many fast, cheap, and bad haircuts, I've found that this aphorism applies not only to coiffuring.

You can find tradeoffs like this anywhere you look. When it comes to data storage, the tradeoff is between consistency, availability, and partition tolerance. According to a famous conjecture by Eric Brewer--and later formally proven by Seth Gilbert and Nancy Lynch--you can only pick two of the three.

In any production environment, you have distributed data sets stored on a number of different nodes. The edge case is one node, but it's most likely (I hope) that you have data distributed across two or more. MS SQL Server, for example, lets you easily set up replication across two nodes in an active-passive model (Microsoft calls this mirroring). If one node goes down, the other takes over. When the first comes back on, it syncs with the second. Amazon's Dynamo replicates over N nodes, providing a highly configurable and fault-tolerant environment. If a node goes down, its data will be replicated to others. But you don't have to go that far. A tape backup could be considered a high-latency node and might be sufficient, depending on your needs.

Before talking about the tradeoffs, let's consider a few definitions.

Consistency refers to the degree to which nodes in a distributed system agree about a data set. In a 100% consistent system, all nodes see the same data at the same instant. Consistency could be measured as a fraction of the number of nodes. Given 5 nodes where 4 agree, you'd have 80% consistency.

Availability refers to the responsiveness of a system. In a 100% available system, all requests will receive responses. Availability is usually called uptime and is measured in units of time. In order to increase availability, you typically increase the number of nodes in your network.

Finally, partition tolerance refers to the ability of the system to handle disruptions in network communication between nodes. A 100% partition tolerant system would be able to work without any network communication. That is, it wouldn't be a distributed system.

From these definitions, you should be able to guess at why you can't have all three. Rather than offering a formal proof, I think it's easiest to see why by considering each combination in turn.

A consistent and available system (CA) will always give you the same answer, and it will always respond to requests. However, it won't be distributed. The token CA system is a single database, whether it be PostgreSQL or Redis. As long as you use some kind of transaction isolation level, you don't have to worry about inconsistencies in your data, and you don't have to build any application logic to shard data to different databases. As your data grows, however, this system may not suffice.

A consistent and partition-tolerant system (CP) is a distributed system that enforces requirements regarding the number of nodes that must be updated in order for a write to be considered successful. HBase and MongoDB do this, for example. If nodes go down, they will replicate data to other nodes. However, because they have consistency guarantees, it is possible that they may refuse reads or writes in the event of a network failure. For example, if you have a three-node system that requires two nodes to agree, and you have a network error in which one node is separated from the other two, that node will not be available. Any clients connecting to it will be denied service.

Finally, an available and partition-tolerant system (AP) replicates data all over the place so it can't possibly guarantee consistency. For example, CouchDB and Riak embrace fault-tolerance over consistency. The idea is that they will be eventually consistent, but they might not be consistent immediately following a write or network outage.

Now, almost any system can be CAP if you're willing to wait. The tradeoff only exists when we're talking about a single instant. If a system takes only a few ms to become consistent, this may be sufficient. The Internet's DNS will eventually be consistent, but there are no guarantees about how long this will take. Riak lets you tweak the number of nodes that must agree before a read or write is considered successful. Still, in a sufficiently large system, you may not be able to wait that long.

Diving into the NoSQL universe may be daunting, but a big part is determining where you fall in the CAP spectrum. Even if you're not considering a NoSQL database, I think it's useful to understand these tradeoffs. Are there any other tradeoffs facing distributed data solutions?

Sunday, September 9, 2012

Mapreduce and Key-Value Systems

I've been playing around with Riak (ree-ack), my first foray into NoSQL, and it's been a lot of fun. Riak is an open source implementation of Amazon's DynamoDB. It's a web-ready key-value store that scales easily. And it's a whole different world from RDBMS's. Basho has a great tutorial for getting started.

If you're a database guy like me, you're probably wondering what you can do with a key-value store. Using key-value tables is a SQL anti-pattern. Even if you get performance and scalability (which I'll discuss in an upcoming post), it's hard to see how you could do any interesting queries with a key-value system.

One thing to note is that a file system is basically a key-value store. The key is the file name and the value is the contents of the file. You can do searches on file systems, but you can't really do much querying. If your data needs are more like a file system than complex analytics, a key-value store might be right for you.

I also finally learned what the MapReduce algorithmic framework does. You may have heard of MapReduce (amongst many other new technologies) in connection with Hadoop. MapReduce will probably go down in history as one of Google's greatest contributions to computer science (along with the PageRank algorithm).

MapReduce is inspired by a functional language, LISP. The idea is simple: take the algorithm to the data. In a highly distributed and scalable system, it's not feasible to move data to a processor in order to aggregate and slice it. You have to do any processing in-place. The way to do this is to map your keys to some kind of broader category and then run a reduce algorithm over the mapped data to pick out only the information you need.

For example, let's say your key-value store contains documents, and you want to count the number of documents beginning with the letter 'X' that contain the word 'tibialoconcupiscent'. In this case, you would map all documents with a key like 'x*', such as 'xerophagy.pdf'. Your reduction algorithm would simply return 1 for every case of the word and 0 otherwise. This reduction could be run on each server, a group of servers, and then all servers, returning a single (probably small) number.

This framework may seem limited, and it is. Key-value systems just aren't great for complex querying. One reason SQL has been so popular is that it allows for unlimited combinations of queries. You're limited in the ways you store data, and you have to wait for the system to write the data to disk in order to ensure consistency, but you can read the data in lots of interesting ways. That's why it's called a Structured Query Language.

(Riak does let you do some more interesting things because you can create links between different keys that define any kind of relationship between them. These links are just metadata. You can MapReduce across links, but the basic idea is the same.)

Now, where would I use Riak? There are a good number of production users already. I would think it would be best used in systems that don't require a lot of complex querying or complex data types. For instance, it could store webpages, messages, or other content. Unless you have a huge amount of data or require very fast writes, it's probably not necessary. But if you're looking to grow fast, it might be the right choice.

Sunday, April 22, 2012

Bigger Faster Stronger

Scalability is one of those words that everyone uses but few understand. It's a measure of how adding resources (typically hardware) affects performance. You can scale vertically by increasing the power of a server. You can scale horizontally by adding servers. The scalability of a system depends on how performance is defined. Martin Fowler suggests a few categories:

Expect more posts on this one

Response time, or the amount of time it takes to process a request
Responsiveness, or the amount of time it takes to acknowledge a request
Latency, or the amount of time it takes to get a response (this is especially important when there is no data to return)
Throughput, such as transactions / second
Load, or the amount of stress a system is under
Load Sensitivity, or response time / load
Efficiency, or performance / resources
Capacity, as in maximum throughput or load

Systems must be designed to scale, but what scaling means will depend upon the purposes for which a system is built. It might be tempting for database professionals to think about scalability in terms of transactions / second or the number of active accounts. But what really matters is whether or not the system is usable given an increase in transactions or accounts, and this depends upon the use for which the system was created. If we're talking about an e-Commerce system, throughput is probably more important than response time, as long as responsiveness is high. If we're dealing with a manufacturing system, we'll probably be most interested in throughput.

It's important to design systems to be scalable. The Internet has increased adoption rates to unprecedented rates. Consider Instagram, which has 30 million users after 2 years. Draw Something had 36 million users in three weeks. Scalability is a prerequisite for virality.

In the case of N-tier applications which have a Service-Oriented Architecture, it's usually easy to add hardware to the web and application servers. Load balancers and web farms can take care of extra load by distributing it evenly across a number of servers. The real problem is, as always, the database layer.

You can't just add servers to the database layer, because databases must be architected across multiple database servers. Concurrency adds to the difficulty, as database transactions must be ACID (atomic, consistent, isolated, and durable). In other words, you have to manage updates to multiple servers, making sure that an update to Server 2 does not depend on Server 1.

Lighting bolts make it faster

I thought the Cloud might be the solution to database scalability, but Microsoft Azure currently supports databases of only 150 GB in size. In talking with Microsoft consultants, they recommend 'sharding' databases. This means having a master database that directs transactions to the appropriate database server. For instance, all transactions dealing with North American accounts should go to Server 1, South America to Server 2. Sharding adds a layer of abstraction and a layer of complexity, and it requires duplication of database schema, but it's an increasingly popular approach.

Another option is Oracle's RAC system or Microsoft's MatrixDB, which has basically been ported to Azure. I'm skeptical that MatrixDB will make it in to the next edition of SQL Server (2012 has AlwaysOn, which is close, but the mirrors are read-only). In RAC or MatrixDB, databases are replicated across multiple servers and a load balancer directs reads and writes to the server with the least load. Changes are replicated asynchronously between database servers. Still, there are limitations to the size of databases for which this would be feasible.

Relational databases are great up to a certain size (though this is growing, thanks to SSD's and improved caching). It's hard to say exactly what this size is. In the end, scalable databases adhere to principles of normalization and partitioning. After a certain amount of data, RDMS's will be of no use, and NoSQL solutions are the answer to a different problem. Are you ready to scale?