I believe that this is a perfect case for a new type of database, and perhaps even a completely new framework of programming. I don't have a good name for it, and the ideas are still vague in my head right now, but I'll try to outline what I'm thinking of below.
I would break Twitter up into a series of tables which get distributed and replicated among a cluster of servers. The tables would relate to each other, but not in the strict atomic transaction model, but one of eventual consistency. These tables would be:
- Users
- Queues
- Subscriptions
- Content
The bandwidth external to Twitter is pretty high, because you've got lots of people with many subscribers. The amount of actual non-duplicate data is surprisingly small... and I'm guessing that it's on the order of 3kbytes/second. The real challenge is distributing this 3kbytes in a consistent and reliable manner to all the places it gets copied out.
A flow based database would be able to handle such types of loads by maintaining many local copies and keeping them in eventual consistency by tying them into a channel. This is a place where multicast might be a really good strategy, if not a straight peer-to-peer network.
A flow database could be a straight up normal table in an RDBMS, or it could be something new optimized to the task.
What do you folks think?
No comments:
Post a Comment