Friday, May 30, 2008

Flow based databases to fix twitter?

Twitter is a popular web based instant messenger service, which has been having problems with scaling lately. The facts seem to indicate that a traditional Relational DataBase Management System just isn't an appropriate fit in this case.
I believe that this is a perfect case for a new type of database, and perhaps even a completely new framework of programming. I don't have a good name for it, and the ideas are still vague in my head right now, but I'll try to outline what I'm thinking of below.

I would break Twitter up into a series of tables which get distributed and replicated among a cluster of servers. The tables would relate to each other, but not in the strict atomic transaction model, but one of eventual consistency. These tables would be:
  • Users
  • Queues
  • Subscriptions
  • Content
The real trick would be to treat the tables more like queues or pipes full of data that are appended to with very low random write frequency. The changes would then be aggregated in a channel to make it easy to keep multiple copies for coping with the heavy read access from all of the clients connected at any given time.

The bandwidth external to Twitter is pretty high, because you've got lots of people with many subscribers. The amount of actual non-duplicate data is surprisingly small... and I'm guessing that it's on the order of 3kbytes/second. The real challenge is distributing this 3kbytes in a consistent and reliable manner to all the places it gets copied out.

A flow based database would be able to handle such types of loads by maintaining many local copies and keeping them in eventual consistency by tying them into a channel. This is a place where multicast might be a really good strategy, if not a straight peer-to-peer network.

A flow database could be a straight up normal table in an RDBMS, or it could be something new optimized to the task.

What do you folks think?

No comments: