17 Aug, 2009
A Compendium of solutions for scaling a Data Store
Posted by Bhavin Turakhia
Achieving infinite scalability on your data store is the holy grail of scaling an application. App servers are typically stateless and therefore a cinch to scale. This document serves as a comprehensive compendium on my thoughts and research on scaling a data store.
Requirements
- Inifinite Scalability
- High Availability (0% downtime)
- Data Redundancy
- High Performance
- Storage Flexibility – ability to store any type of data
- Query Flexibility – ability to perform simple gets, range based gets, range based updates, and possibly complex joins
Features that a solution must have to deliver the above Requirements
- Replication – Each unit of data should be copied to multiple nodes so that if an underlying node crashes there is no data loss
- Partitioning – Data should be divided across multiple nodes based on specific keys so that the data layer is infinitely scalable
- Online node addition – Solution should support adding new nodes online, with automated data distribution upon new node addition
- Load Balancing – Queries should be load balanced between nodes
- Persistence – Should have a data persistence layer so that data is not volatile
- Caching – Should support flexible in-memory caching for increased retrieval speeds
- Tree based Indexing – To support range based queries ideally one may need tree type indexes for keys on which range queries maybe made
Proposed Solutions
The below solutions are a result of researching a ton of options (refer Research seciton below). They also represent solutions that are practical to deploy and are being used in production. There are exotic variants that I came up with during my research but I have left them out.
Solution 1: Google App Engine
- Provides BigTable – Googles implementation of a scalable database
- BigTable is not an RDBMS, but has a fairly flexible API that supports creative data fetching methods
- BigTable is distributed and self-balancing – scaling is no longer the application developers problem
- You will need to host your application on Google’s App Engine
- Your application needs to use the BigTable API for data storage
Solution 2: MySQL NDB Cluster
- MySQL NDB Cluster is a master-master, self-partitioned, replicated storage engine that technically seems to provides all the features listed above
- It also offers access via SQL or a high-performant native NDB API
- It seems like the holy grail of database scaling
- It however stores indexes entirely in memory – and has lesser flexibility w.r.t persistence
- I could not find much material on the performance of an NDB Cluster
Solution 3: HyperTable
- HyperTable is an opensource BigTable clone and provides essentially the same features as described in the BigTable whitepaper
- It is supported by Baidu, Zvents and Rediff
Solution 4: Project Voldemort
- Voldemort is a big, distributed, persistent, fault-tolerant hash
- Developed and used by Linkedin
- Java based API
- Reasonably performant (10-20k ops on commodity hardware)
- Maintains replicated copies of data over multiple nodes and automatically handles server failures
- Does not support range based querying
Solution 5: Tokyo Tyrant
- Tokyo Tyrant is a layer on top of Tokyo Cabinet – a highly performant, persistent data strore (site claims over 2 million qps)
- Tokyo Tyrant itself claims to deliver upwards of 58,000 qps
- It supports multiple language bindings (Java, Perl, PHP etc)
- Supports various data structures – hash, tree, B+tree, array, table etc
- Supports caching
- Tokyo Tyrant does not support active-active master-master replication, thus failing out on redundancy. It also does not support data partitioning out of the box
Solution 6: Postgres + Pl/Proxy + Replication (Slony / Continuent) + PGBouncer (connection pooler)
- Postgres is an extremely mature RDBMS
- Using PL/Proxy one can abstracte horizontal partitioning concerns out of the database layer and into an abstracted underlying layer
- Using Slony or Continuent one can ensure that multiple copies of any set of rows exist at any given point in time (Synchronous or Async replication)
- PGBouncer provides a light-weight connection pooler for PL/Proxy
- Together this will satisfy all our requirements above
- You may club Voldemort or memcached or redis with this to provide a caching layer
- Postgres also has hooks for connecting to memcached that make cache population and invalidation easier
Research
A lot of reading went behind discovering the above solutions. The below links are a good start -
BigTable
- The BigTable paper – http://labs.google.com/papers/bigtable.html
- A powerpoint that touches upon bigtable, gfs etc - http://cbcg.net/talks/googleinternals/
- Condor – a specialized workload management and job queue system – http://www.cs.wisc.edu/condor/description.html
- Notes on Jeff Dean’s talk at Univ of Washington on BigTable – http://andrewhitchcock.org/?post=214
- Video on bigtable – http://video.google.com/videoplay?docid=7278544055668715642&q=bigtable
- A blog post walkthrough of the bigtable paper – http://hnr.dnsalias.net/wordpress/2008/10/bigtable-googles-distributed-data-store/
- A description of Paxos – http://en.wikipedia.org/wiki/Paxos_algorithm
- Project Boxwood – Microsoft Research project for a scalable data layer – http://research.microsoft.com/en-us/projects/boxwood/default.aspx
Google App Engine
- http://code.google.com/appengine/
- Campfire video introducing app engine – http://www.youtube.com/watch?v=3Ztr-HhWX1c
- Getting started guide – http://code.google.com/appengine/docs/python/gettingstarted/
- Using the datastore – http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
MySQL Cluster
Amazon SimpleDB
Cassandra
Hypertable
MongoDB
neo4j
- http://neo4j.org/
- http://wiki.neo4j.org/content/Main_Page
- http://wiki.neo4j.org/content/Neo_Performance_Guide
- http://wiki.neo4j.org/content/FAQ
- http://dist.neo4j.org/neo-technology-introduction.pdf
- http://wiki.neo4j.org/content/Getting_Started_Guide
- http://wiki.neo4j.org/content/Getting_Started_In_One_Minute_Guide
Project Voldemort
Redis
Tokyo Cabinet
Lightcloud
Gigaspaces
Coherence
Postgres
- http://www.slony.info/
- http://www.slony.info/documentation/
- https://developer.skype.com/SkypeGarage/DbProjects/PlProxy
- https://developer.skype.com/SkypeGarage/DbProjects/PgBouncer
- http://www.continuent.com/solutions/overview
- http://www.continuent.com/images/stories/pdfs/tungsten%20overview%20white%20paper.pdf
Tags: bigtable, cassandra, google, hypertable, memcached, mysql, ndb, postgres, rdbms, scalability, simpledb, voldemort










Very nice article! Something interesting i found on these lines is that google is planning to launch a new file system. More info here http://www.dailytech.com/Google+Works+New+File+System+into+Caffeine+Plans/article16004.htm
Another scalable data storage worthy of mention is CouchDB.
http://couchdb.apache.org/
Hi,
I have some questions.
How do you compare BigTable with HyperTable? If you could choose an alternative to big table, would you choose HyperTable or Voldemort? and why?
confused student,
Best.
HBase will be another worthy entry to this list.based on Hadoop.
http://hadoop.apache.org/hbase/
Rosetta Stone Spanish (Latin America) Level 1 with Audio Companion. I have used this popular Spanish rosetta stone software to learn Spanish with mix results.Our “TopTenREVIEWS Bronze Award” went to Rosetta stone french, a recognized leader in the language learning industry. Note for Intel Macintosh Users: rosetta stone V3 application for Macintosh is a universal application that will run natively on Intel Macs.Learn French in your own time and have fun.