Saturday 18 October 2014

DB - distributed file systems - Thrift

Just use MySQL or Postgres. Why? If your entire _active set_ fits in a single machine's main memory (which can be as high as 128GB+ with modern commodity machines) you don't have a horizontal scalability problem: i.e., there is absolutely no reason for you to partition ("shard") your database and give up relations. If your active data set fits in memory most any properly tuned database with an index will perform well enough to saturate your Ethernet card before the database itself becomes a limitation.

If you decide the relational model itself isn't a good fit, you can easily build a "document oriented store" on top of MySQL: this is what Friendfeed ended up doing, I'd follow their model (except I'd use Avro (software), Apache Thrift, or Protocol Buffers instead of language-specific serialization) -- http://bret.appspot.com/entry/ho...

If your site becomes immensely successful, you will have an active set that no longer fits into your machine's main memory. In this case, an improperly designed storage engine's performance will fall off rapidly. MySQL's InnoDB (or Postgres's storage engine), however, will still allow you maintain (depending on your request distribution) a ~2:1-5:1 data to memory ratio with a spinning disk.  Once you've gone beyond that, performance begins to fall of rapidly (as you're making multiple disk seeks for every request). Now, your best course of action is just to upgrade to SSDs (solid state drives), which -- again -- allow you to saturate your Ethernet card *before* the database becomes a limitation.

Finally, when you get to data set size that doesn't fit on, e.g., several SSDs in a software raid 1+0 configuration (while giving you space for backups, multiple versions of data, etc...) then you have to scale horizontally. That is, you will have to use a database that intrinsically supports partitioning (e.g., Riak, Voldemort, Cassandra, HBase) or build an application-level partitioning layer on top of your MySQL/Postgres based data store. I can't tell you which solution is correct, as neither I (nor you) have any clue of what your data and its access patterns will be like at that point. That said, writing your own sharding layer is yet another place where you can introduce additional bugs into the code: not having to build your own distributed database (what you are effectively doing by building a sharding layer) is the major appeal of using an existing, scalable NoSQL database.

Note, how I am still not bringing the CAP theorem into play. The reason is that CAP itself has nothing to do with scalability, but everything to do with availability and handling of failures. What it means is that under certain failure scenarios (called Partitions, not to be confused with database partitioning!), you can not retain Availability and provide for linearizable Consistency at the same time. Linearizable consistency roughly corresponds to A and I in ACID. This has more to do with replication of a single entity (e.g., a row in a database) across multiple machines, with horizontal partitioning it's already difficult (for other reasons) to perform transactions between multiple entities in a database.

It's a common misconception that SQL databases "choose C" and "NoSQL" databases "choose A". In reality, I believe several SQL databases do *not* by default use the "serializable" transaction isolation level (choosing instead snapshot isolation) even on a single machine. When using MySQL's asynchronous replication, it's possible to be in a scenario where a master machine receives a write, allows readers to see the write, and then goes down *before* shipping this value to another replica -- i.e., losing serializable consistency when the other replica is read from (upon the master's failure).

At the mean time, many NoSQL database (e.g., HBase) do not actually provide "cap-A" Availability (in exchange for atomic mutation/compare-and-set operations, e.g., atomically incrementing a column within a row in HBase) or allow themselves to be configured (e.g., Voldemort or Riak configured to require strict read and write quorums) for consistency rather than availability (e.g., for applications such as counters).

There's also a hidden variable in CAP: latency. If you can simply re-try your operation until a new master node is elected or goes back online (which is usually fast as most failures are transient), you will, effectively have both high A-availability and C-consistency, as you can simply wait for the P-partition to be over (this time is called "MTTR"). That, obviously, isn't an option for large sites: users will click away if they wait too long for pages to load, money will be lost if items can't be added to shopping carts or ads can't be displayed. However, that isn't necessarily a concern when your traffic volumes aren't significant: again, this is a business decision.

Which CAP trade-offs do you choose? That, again, depends on your  application, and your data. You may note that many large applications  (e.g., complex websites) use a combination of the two (strong  consistency for some operations, highly availability for others),  depending on the business requirements.

(Note: I am grossly oversimplifying and speaking in the context of a single datacenter. When you have replication across a WAN, strong consistency becomes impractical (the latency costs are prohibitive) -- that's why, e.g., HBase supports log shipping to allow asynchronous replication to a remote site).

Summary: understand your data and application, and *then* plan for providing scalability and high availability for your data and application. If you're intellectually curious about distributed systems and issues like CAP et al, see the answers in What are the best resources for learning about distributed file systems?


---------------------

http://thrift.apache.org/ 
Thrift:
- has been open source for longer, so it has support for more languages.
- is lacking much documentation, though there are pretty good examples
- is slightly slower
- includes an RPC framework

Overall I'd generally use Thrift if the use case was to quickly make a server that can serve RPCs. If I was storing data in a log or a flat file and needed a serialization format, and was only going to one of the languages Protocol Buffers supported, then I'd use that.


----------------------



systems-oriented reading list in approximately chronological order:

* Design and Implementation of the Sun Network Filesystem - http://www.cs.ucsb.edu/~ravenben...
* Scale and Performance in a Distributed Filesystem - http://citeseerx.ist.psu.edu/vie...
* A Case for Redundant Arrays of Inexpensive Disks (RAID) - http://www.cs.cmu.edy/~garth/RAIDpaper/Patterson88.pdf
* Separating data and control transfer in distributed operating systems - http://portal.acm.org/citation.c...
* Zebra (and other research from Sprite) - http://www.eecs.berkeley.edu/Res...
* Disconnected Operation in the Coda Filesystem - http://www.cs.ucsb.edu/~ravenben...
* Frangipani: A Scalable Distributed Filesystem -
http://pdos.csail.mit.edu/6.824/... (and Petal: http://net.pku.edu.cn/~course/cs...)
* The Google File System - http://labs.google.com/papers/gf...
* CEPH: A Scalable, High-performance Distributed Filesystem - http://ceph.newdream.net/papers/...

You can't just dive in and read all the literature so you've got to choose an angle. Since (in my opinion) there's little fundamental theory that you have to learn before you can make sense of research filesystems, and much of the difficulty lies in the massive and interesting engineering challenges that scale and performance offer, it makes sense to read about the systems first and dive into the areas of theory you find interesting.