Sunday, April 22, 2012

Bigger Faster Stronger

Scalability is one of those words that everyone uses but few understand. It's a measure of how adding resources (typically hardware) affects performance. You can scale vertically by increasing the power of a server. You can scale horizontally by adding servers. The scalability of a system depends on how performance is defined. Martin Fowler suggests a few categories:
Expect more posts on this one
  • Response time, or the amount of time it takes to process a request
  • Responsiveness, or the amount of time it takes to acknowledge a request
  • Latency, or the amount of time it takes to get a response (this is especially important when there is no data to return)
  • Throughput, such as transactions / second
  • Load, or the amount of stress a system is under
  • Load Sensitivity, or response time / load
  • Efficiency, or performance / resources
  • Capacity, as in maximum throughput or load

Systems must be designed to scale, but what scaling means will depend upon the purposes for which a system is built. It might be tempting for database professionals to think about scalability in terms of transactions / second or the number of active accounts. But what really matters is whether or not the system is usable given an increase in transactions or accounts, and this depends upon the use for which the system was created. If we're talking about an e-Commerce system, throughput is probably more important than response time, as long as responsiveness is high. If we're dealing with a manufacturing system, we'll probably be most interested in throughput.

It's important to design systems to be scalable. The Internet has increased adoption rates to unprecedented rates. Consider Instagram, which has 30 million users after 2 years. Draw Something had 36 million users in three weeks. Scalability is a prerequisite for virality.

In the case of N-tier applications which have a Service-Oriented Architecture, it's usually easy to add hardware to the web and application servers. Load balancers and web farms can take care of extra load by distributing it evenly across a number of servers. The real problem is, as always, the database layer.

You can't just add servers to the database layer, because databases must be architected across multiple database servers. Concurrency adds to the difficulty, as database transactions must be ACID (atomic, consistent, isolated, and durable). In other words, you have to manage updates to multiple servers, making sure that an update to Server 2 does not depend on Server 1.

Lighting bolts make it faster
I thought the Cloud might be the solution to database scalability, but Microsoft Azure currently supports databases of only 150 GB in size. In talking with Microsoft consultants, they recommend 'sharding' databases. This means having a master database that directs transactions to the appropriate database server. For instance, all transactions dealing with North American accounts should go to Server 1, South America to Server 2. Sharding adds a layer of abstraction and a layer of complexity, and it requires duplication of database schema, but it's an increasingly popular approach.

Another option is Oracle's RAC system or Microsoft's MatrixDB, which has basically been ported to Azure. I'm skeptical that MatrixDB will make it in to the next edition of SQL Server (2012 has AlwaysOn, which is close, but the mirrors are read-only). In RAC or MatrixDB, databases are replicated across multiple servers and a load balancer directs reads and writes to the server with the least load. Changes are replicated asynchronously between database servers. Still, there are limitations to the size of databases for which this would be feasible.

Relational databases are great up to a certain size (though this is growing, thanks to SSD's and improved caching). It's hard to say exactly what this size is. In the end, scalable databases adhere to principles of normalization and partitioning. After a certain amount of data, RDMS's will be of no use, and NoSQL solutions are the answer to a different problem. Are you ready to scale?

3 comments:

  1. Zach,

    This is a great post - well done. It's also a subject that is near and dear to my heart. Have you considered MarkLogic to meet your scalability needs? MarkLogic is a fully transactional, ACID-compliant DBMS that scales to the TB and PB levels using commodity hardware and provides sub-second response times on queries.

    Some points of interest...

    Data Model: MarkLogic is based on a document model that underlyingly represents information as XML. MarkLogic doesn't require an upfront data-modeling tasks as would typically be required in traditional (relational) databases because the document *is* the model, thus drastically reducing development cycles.

    Scalability: MarkLogic benefits from a shared-nothing, distributed architecture allowing for ease of horizontal scale (certainly, if you want to beef up the hardware for vertical scale, then that can be done, as well -- you have the flexibility to choose.) Adding or removing a node to a cluster is literally an 5-minute install or uninstall. Really, it’s that easy.

    Sharding: MarkLogic Server utilizes hashing to distribute documents across cluster of servers. Each host in a cluster manages a partition of the database, enabling parallel computation while maximizing IO and Memory utilization of the system. The sharding scheme of MarkLogic offers unbounded growth potential while maintaining sub second response times in the face of growing query and data volumes.

    NoSQL and Big Data: In my opinion, NoSQL really implies "non-relational". This places less emphasis on the query language and increased emphasis on the challenges of relational databases to handle the volume, variety, velocity, and complexity of Big Data. Either way, whether you hate the query language or the paradigm, MarkLogic has been doing NoSQL and Big Data for over 10 years – well before these were trending terms.

    MarkLogic has a significant footprint in the financial services industry. (http://www.marklogic.com/solutions/financial-services/) If you're interested to learn more, I'd be happy to discuss this with you. Coincidentally, the MarkLogic World conference is being held May 1-3 in Washington DC if you're able to make it. (www.marklogicworld.com).

    Best,
    Gary L. Katz

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Zach,

    This is a great post - well done. It's also a subject that is near and dear to my heart. Have you considered MarkLogic to meet your scalability needs? MarkLogic is a fully transactional, ACID-compliant DBMS that scales to the TB and PB levels using commodity hardware and provides sub-second response times on queries.

    Some points of interest...

    Data Model: MarkLogic is based on a document model that underlyingly represents information as XML. MarkLogic doesn't require an upfront data-modeling tasks as would typically be required in traditional (relational) databases because the document *is* the model, thus drastically reducing development cycles.

    Scalability: MarkLogic benefits from a shared-nothing, distributed architecture allowing for ease of horizontal scale (certainly, if you want to beef up the hardware for vertical scale, then that can be done, as well -- you have the flexibility to choose.) Adding or removing a node to a cluster is literally an 5-minute install or uninstall. Really, it’s that easy.

    Sharding: MarkLogic Server utilizes hashing to distribute documents across cluster of servers. Each host in a cluster manages a partition of the database, enabling parallel computation while maximizing IO and Memory utilization of the system. The sharding scheme of MarkLogic offers unbounded growth potential while maintaining sub second response times in the face of growing query and data volumes.

    NoSQL and Big Data: In my opinion, NoSQL really implies "non-relational". This places less emphasis on the query language and increased emphasis on the challenges of relational databases to handle the volume, variety, velocity, and complexity of Big Data. Either way, whether you hate the query language or the paradigm, MarkLogic has been doing NoSQL and Big Data for over 10 years – well before these were trending terms.

    MarkLogic has a significant footprint in the financial services industry. (http://www.marklogic.com/solutions/financial-services/) If you're interested to learn more, I'd be happy to discuss this with you. Coincidentally, the MarkLogic World conference is being held May 1-3 in Washington DC if you're able to make it. (www.marklogicworld.com).

    Best,
    Gary L. Katz

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...