Architecture : How would you go about building an activity feed like Facebook?View other answers to this thread
Here are a few questions you might answer to guide yourself through building a Facebook-like feed:
How many followers will the most popular content sources have?
Twitter used to call this the "Bieber problem": when he tweeted, a fan-out operation occurred, placing this new tweet into the feeds of every one of his followers which was a massive job that locked up whole machines. In Twitter's earlier years, a couple quick @justinbieber tweets in a row could "fail whale" the whole service.
The "Bieber problem" is an illustration of a critical decision of the architecture of a feed service: do you build a user's feed at write time or at read time? Twitter's architecture placed tweets into feeds as soon as they were written into the service. By contrast, Facebook's architecture waits for a user to read their feed before actually looking up the latest posts from each followed account.
In this example, Bieber is a "source" of stories. What are your sources and how popular will they be? Will you have broad topics where one story may be distributed to many people? If you choose a write-time architecture and have one million users following a popular source, how long will it take for the one-millionth user to finally receive the post into their feed? Especially as you start to design a sharding strategy, this is a very important question.
Chronological feed or Ranked feed?
Should the story at the very top of your activity feed be the very most latest thing that happened, or should it be the very most interesting thing that happened? This is more of a product decision than an architectural one, but it has serious implications on your feed architecture.
Twitter and Instagram are good examples of chronological feeds (although Twitter has recently introduced partially ranked feeds). The "write time" post distribution makes it difficult to do anything other than a chronological feed.
Facebook is of course a Ranked feed. The way it works is when you load your Newsfeed, all of the stories posted since the last time you looked are given an "interestingness" value (how long ago it was posted is a factor) and that becomes the sort for those stories. You can think of this kind of feed like layers of sedimentary rock. As soon as a new collection of stories are loaded, they're frozen in place so the feed looks familiar, each new viewing brings more stories atop the last set. A few rare viewings will appear largely ranked and not chronological, many frequent viewings will appear more ranked and less chronological.
Can stories change position in the feed?
When a tweet is retweeted, you may see it in your feed at the time it was re-tweeted. If multiple people retweet it, it may appear where the first follower retweeted it but not appear a second time. If a story on Facebook gets a particularly relevant comment, it may resurface. Consider if your stories may change position in feed for any reason and consider how your feed architecture could enable this. Also think about what would happen to a client (mobile app, website) that was already showing the same story lower in the feed. You probably want to avoid showing duplicates!
Can stories be aggregated together?
If you have stories from subscribed topics, will you ever have an aggregation story like "5 new articles were posted to 'Politics'."? If multiple people "share" or "retweet" a story, do you show all of those actions together? If stories are replies to one-another (twitter @mentions, a good example) do you show them in the feed in conversational order, or leave them in upside-down chronological order? This may also change the position of an existing story in feed if a later event causes it to be "aggregated" like this.
Can stories be later revoked from a feed?
Can the author of the post remove it? Perhaps the author of the post blocks the viewer or vice-versa? Or the viewer un-follows the source after it was already put into a feed? Consider how these kinds of actions would impact a cached feed.
Hopefully those questions are useful in helping get more concrete about the kinds of problems you might have to solve in building a feed.
I definitely suggest being able to access a cached version, or at least partially-cached list of stories to improve speed. One of the benefits of a "write-time" feed is that you can build a very fast access cached feed. One of Instagram's best early features was its high performance. Despite being loaded on crappy iPhone 3GS on slow 3G networks, the Instagram feed always loaded lightning fast. There were zero SQL queries involved with loading a feed.
Facebook's ranked feed requires more work to cache, but because each previously loaded set of stories is saved, it can be very fast to load a second time. Part of the secret of producing a ranked feed quickly is parallelization, and prioritizing getting the first couple stories ready first so a user can start reading while the server is preparing the rest. Facebook has built custom infrastructure for feed ranking at the speed it does for as many users as it has, but the rough idea is to create something like a "max-heap" data structure so the stories most likely to be interesting are quickly available.
Also remember YAGNI. These are all great things to think about to figure out what problems you'll need to solve first, but if you're building something brand new, you're better off going for whatever will be simple and fast and tackling these problems as they become relevant for your product.