MySQL is obsolete

It’s hard for me to admit, but the landscape of web application technology has recently shifted.  MySQL, and relational databases in general, have started to become obsolete.  The problem is that MySQL doesn’t scale.  Well, it does, but only so far.  A typical story:

Your new database-backed website is starting to become popular, and load times at peak hours are starting to rise.  You add a slave server or two to spread out the reads, and put a lot of the reads in a cache.  That helps, and load times fall.   You can repeat this a few times, but eventually you hit a wall where you have too many writes to handle, no matter how much hardware you have.  Then you’re stuck.  You have to start partitioning your data into separate clusters, but your application was written with a lot of joins, and a lot of stuff no longer works if everything isn’t in the same database.  So, you’ve got a big development effort that has a high possibility of failure.

I suspect that this story would happen to the majority of startups out there if they had dramatic success.  It happened to Friendster, and it cost them global success as the social networking space which they basically invented and completely owned slipped through their fingers.

Relational databases suck for another reason.  If you want any kind of user-defined data, where the user needs to customize the fields gathered, you are faced with either dynamically altering tables (maintenance nightmare, not scalable), or entity-attribute-value tables (hard to query, not scalable).   Scalability is great, but this is the real flaw that’s been hitting me lately.  There are no good solutions to this problem in the relational world.  Google had all these problems years ago with their web indexing database, so they made BigTable, which currently holds their entire 800TB index of the web with metadata in a single table.  That technology now holds almost everything they do.

BigTable is Google-proprietary, unfortunately.  But as of the last year or so, there are reasonably-mature implementations of BigTable and the Google File System that it runs on.  Hadoop is the GFS equivalent, it takes care of managing an arbitrarily-large set of constantly-changing data.  HBase is the BigTable equivalent, providing a REST interface to a simple query language.  These are both very new technologies, but Yahoo is using Hadoop to manage its crawled web data, and HBase is running Powerset, one of the most database-intensive web ideas I can think of.

These are column-oriented databases.  You don’t define columns, you define column families, and a row can have zero to many of each family, each uniquely tagged.  This lets you do away with many-to-many joins, and have infinite capability for user-defined fields that are stored with the main data.  Empty columns are never even stored.

We’re planning to run Hadoop and HBase on a cluster of Amazon EC2 machines.  We’ll use S3 as the disk that the Hadoop cluster reads and writes from.  What this means is that since every system (web, DB, storage) scales with the number of machines in operation, all more traffic will ever mean to us is a bigger bill from Amazon.  And I bet that bill will be lower than the total cost of ownership of the equivalent traditional setup, because we can run fewer servers at night and save money.

So, the next bit of my life will be spent writing a solid database abstration layer between CakePHP and HBase, and an EC2 monitor and dynamic machine allocator.  I hope to open-source these so we can make this a realistic choice for more people out there.


About this entry