Architecture : How would you go about building an activity feed like Facebook?

Responses (9)

Join the discussion

This answer has received 18 appreciations.

Here are a few questions you might answer to guide yourself through building a Facebook-like feed:

Twitter used to call this the "Bieber problem": when he tweeted, a fan-out operation occurred, placing this new tweet into the feeds of every one of his followers which was a massive job that locked up whole machines. In Twitter's earlier years, a couple quick @justinbieber tweets in a row could "fail whale" the whole service.

The "Bieber problem" is an illustration of a critical decision of the architecture of a feed service: do you build a user's feed at write time or at read time? Twitter's architecture placed tweets into feeds as soon as they were written into the service. By contrast, Facebook's architecture waits for a user to read their feed before actually looking up the latest posts from each followed account.

In this example, Bieber is a "source" of stories. What are your sources and how popular will they be? Will you have broad topics where one story may be distributed to many people? If you choose a write-time architecture and have one million users following a popular source, how long will it take for the one-millionth user to finally receive the post into their feed? Especially as you start to design a sharding strategy, this is a very important question.

Chronological feed or Ranked feed?

Should the story at the very top of your activity feed be the very most latest thing that happened, or should it be the very most interesting thing that happened? This is more of a product decision than an architectural one, but it has serious implications on your feed architecture.

Twitter and Instagram are good examples of chronological feeds (although Twitter has recently introduced partially ranked feeds). The "write time" post distribution makes it difficult to do anything other than a chronological feed.

Facebook is of course a Ranked feed. The way it works is when you load your Newsfeed, all of the stories posted since the last time you looked are given an "interestingness" value (how long ago it was posted is a factor) and that becomes the sort for those stories. You can think of this kind of feed like layers of sedimentary rock. As soon as a new collection of stories are loaded, they're frozen in place so the feed looks familiar, each new viewing brings more stories atop the last set. A few rare viewings will appear largely ranked and not chronological, many frequent viewings will appear more ranked and less chronological.

Can stories change position in the feed?

When a tweet is retweeted, you may see it in your feed at the time it was re-tweeted. If multiple people retweet it, it may appear where the first follower retweeted it but not appear a second time. If a story on Facebook gets a particularly relevant comment, it may resurface. Consider if your stories may change position in feed for any reason and consider how your feed architecture could enable this. Also think about what would happen to a client (mobile app, website) that was already showing the same story lower in the feed. You probably want to avoid showing duplicates!

Can stories be aggregated together?

If you have stories from subscribed topics, will you ever have an aggregation story like "5 new articles were posted to 'Politics'."? If multiple people "share" or "retweet" a story, do you show all of those actions together? If stories are replies to one-another (twitter @mentions, a good example) do you show them in the feed in conversational order, or leave them in upside-down chronological order? This may also change the position of an existing story in feed if a later event causes it to be "aggregated" like this.

Can stories be later revoked from a feed?

Can the author of the post remove it? Perhaps the author of the post blocks the viewer or vice-versa? Or the viewer un-follows the source after it was already put into a feed? Consider how these kinds of actions would impact a cached feed.


Hopefully those questions are useful in helping get more concrete about the kinds of problems you might have to solve in building a feed.

I definitely suggest being able to access a cached version, or at least partially-cached list of stories to improve speed. One of the benefits of a "write-time" feed is that you can build a very fast access cached feed. One of Instagram's best early features was its high performance. Despite being loaded on crappy iPhone 3GS on slow 3G networks, the Instagram feed always loaded lightning fast. There were zero SQL queries involved with loading a feed.

Facebook's ranked feed requires more work to cache, but because each previously loaded set of stories is saved, it can be very fast to load a second time. Part of the secret of producing a ranked feed quickly is parallelization, and prioritizing getting the first couple stories ready first so a user can start reading while the server is preparing the rest. Facebook has built custom infrastructure for feed ranking at the speed it does for as many users as it has, but the rough idea is to create something like a "max-heap" data structure so the stories most likely to be interesting are quickly available.

Also remember YAGNI. These are all great things to think about to figure out what problems you'll need to solve first, but if you're building something brand new, you're better off going for whatever will be simple and fast and tackling these problems as they become relevant for your product.

This is the most excellent response, and it's no wonder it's got so many votes. I wanted to offer another open source library for running a chronological feed computed at write time: https://github.com/kigster/simple-feed/

Write a reply...

This answer has received 9 appreciations.

First question I'm answering on Hashnode. I'm the author of Stream-Framework and run getstream.io for a living. I dream about feeds :)

First of all I would recommend studying some of the great articles about feeds. I've attached my favorites at the bottom of this answer.

Naming conventions

Start by reading this spec: activitystrea.ms/specs/atom/1.0 There is also an updated W3 spec coming up that makes some small changes.

Functionality

A lot of the design decisions depend on what functionality you need:

  • Notification/Realtime (Listen to changes in realtime)
  • Aggregation (Ben and 3 people like your picture)
  • Ranked Feeds (Sorting on more than just recency)
  • Personalized feeds (Unique sorting logic based on the user. Often done using machine learning and analytics)

Fanout on read/ Fanout on write

Most feed systems either use fanout on read or fanout on write. Fanout on write is the more common choice. Instagram, Twitter, Stream all started out that way. Fanout on read is easier to build, but getting a good 99th percentile load time is really tricky.

This paper is a great introduction into the tradeoffs between those 2 approaches: Yahoo Research Paper

For most apps you will need to use a combination of push and pull at some point. That's what we do at getstream.io

Storage and message brokers

First of all you will need to pick a message broker for your fanout on write. My recommendation is RabbitMQ for mid size projects. If you have more time available Kafka is a great option. It scales much better than RabbitMQ. Unfortunately it is still really hard to use and maintain.

For storing the activity feeds I would recommend Cassandra. Many people start out with Redis, but it is very easy to run into limitations. Especially if you want to do aggregated feeds or otherwise need to store a lot of things in memory. Redis can get expensive very quickly. Cassandra is what Stream and Instagram use.

If you're building support for fanout on read I recommend either: Postgres, Redis or ElasticSearch. ElasticSearch can be tuned to do fanout on read very efficiently. (see this post Ranked feeds with ES) Redis is a good option for fanout on read if you use ZUnionStore. Postgres eventually breaks for a fanout on read approach. But you can tweak it to last for quite some time. (a really old HighScalability article about our approach at Fashiolista.)

Realtime

Faye is a great open source project. In terms of hosted solutions PubNub and Pusher are awesome options.

Caching/Locking

If you need to cache some data simply use Redis. Redis also has a great locking implementation if you need to lock before writing to certain feeds. Try to avoid locking at all cost though.

Removing activities/Content checks

Eventually your users will post spam and inappropriate content. This is tricky to deal with as removing activities from all the feeds can take some time. Here's the common solution for this issue:

  • Set a flag on the activity (ie. inappropriate=true, privacy=me)
  • Filter these activities while reading the feed
  • In the background run the delete command

Priority queues

You will want to have a higher priority for follows and direct inserts. Otherwise your users will be waiting for their feed to show :)

You task queueing system will have to support some level of priorities. This is easy to do with tools like Celery.

Design Resources

getstream.io/activity-feed-design

getstream.io/based-feed-ui-kit-sketch

getstream.io/blog/13-tips-for-a-highly-enga..

Articles

Twitter 2013 Redis based Etsy feed scaling LinkedIn ranked feeds Facebook history

Activity stream specification

FriendFeed approach

Yahoo Research Paper

Twitter’s approach Cassandra at Instagram

Relevancy at Etsy

Zite architecture overview

Ranked feeds with ES

Riak at Xing - by Dr. Stefan Kaes & Sebastian Röbke

Riak and Scala at Yammer

My projects

Stream-Framework

getstream.io

Write a reply...

If I were to build this, I'd probably consider SQL over a JSON store, but it can be done in either. The problem isn't the storage mechanism or the language IMO, the problem is the structure.

First - understand a post is a post and user / page / company / group is a content creator - all 4 can be treated the same, the cavat being presenting a different layout for each type of content creator.

That said, I'll call a content creator a user for the sake of keeping the code short and clean.

user (content creator)
-- (unique) id
-- name
-- email
-- etc...

user_following
-- (unique) id
-- user_id

user_content
-- (unique) id
-- user_id
-- post_title
-- post_body
-- post_created_date
-- etc...

user_feed
-- (unique) id
-- creator_id
-- content_id

Now - content creator creates a post - this goes into user_content

Visitor wants to follow a content creator and get their posts in their feed - the follow button on the content creators "page" inserts an entry into the user_following table with the user_id of the visitor

When the content creator creates new content, it loops through it's user_following table and inserts a reference to the user_content entry into the visitors user_feed table - so roughly, for every follower a content creator has, upon a new post being created, it'll also create a reference for each of those followers in their user_feed table

When the visitor visits the site, it queries their user_feed table, pulls the references to the content_id and the creator_id and queries for the actual content. In SQL, this could be done with a join; in JSON you'll need to store all the data (and update, delete it) in the user_feed table

Additional tables will be needed for comments, likes, analytics, etc... but that should be the jist of it.

Past that, how the system actually functions comes into play. If a content creator had 100 posts and I suddenly unfollow them - do their past posts remain in my feed if I scroll back far enough and the change is only going forward or will past posts be removed from my feed? A simple query in SQL but in JSON, you'll need to store a lot more references to remove the old posts from my feed.

As for caching - a combination of SQL and JSON could be used - this is a whole different conversation IMO - SQL should be able to scale this nicely - with JSON, you may end up with a very big table if you have enough users / content. Caching of the actual code / html / pages shouldn't be an issue and any language should be able to accommodate. You could keep a "recent" activity feed in something like Firebase allowing the user to see real time, new posts, in their feed and past say, 20 or 40 posts, query the DB for the remainder. Again, it's a matter of how the system fill function after the initial design functionality will be built.

Write a reply...

This is quite an interesting question, really. Let's see how I'd go about it.

Alright, so the problem boils down to this: how to notify a user of any changes live from the people he's following. Okay.

So, let's break this problem down: a user can have a unique ID; and whenever s/he posts something, a small JSON object is pushed to a queue:

{
    "id": <GUID>
    "user_id": <USER_ID>
    "item_id": <ITEM_ID>
}

Here, item_id can be the item (image, post, etc.) itself. Now, whenever a user posts something, this gets queued. And when the queue gets this object, it sends out a push notification/publish-subscribe message to the followers the person has. On receiving the message, the message is displayed in the user's feed.

Now, I would like to say that publishing the item_id would be better, and then later, fetching the entire data would make this system faster.

There are other methods to do this as well. I would love to design one if you want me to! :)

I hope this helps!

Write a reply...

The other answers here are great, I just would point out that Facebook are likely using a Graph Database to link you to your activities. It is much faster for this kind of application. All actual data for each of those nodes would be stored in something like MySQL or PostgreSQL.

Write a reply...

Doesn't matter which database or frontend framework. Start with data structure. For scalability purpose I would probably have many cross tables to link data, or the possibility to link to different tables from inside other tables, so changing structures is not so hard later on. Than choose a framework/orm/dbal that can easily create links and associations. E.g. just a quick mindsnippet:

user (id, role, state, personal data)
group(id, name, data)
x_user_group(user_id, group_id)
post (id, type, date, author, content)
x_post_group(post_id, group_id)
media (id, type, author, ...)
x_post_media (post_id, media_id)    
related (table, row1_id, row2_id) {every object can be related to others of that type}
follow(id, user_id, table, row_id) {user can follow users, posts, new types later on...}
comment(id, table, row_id, author, content)

Then you can create an activity feed based on who and what user follows and relations and comments for those objects

It's not a gazillion records it's a gazillion users requesting different subsets of data across join operations.

Write a reply...

If probably start with Node and RethinkDB... though plan on moving to separate data stores once you expand beyond what a few dozen servers can handle (around 10 million regular users).

A this point, you'll likely be using some que systems, a graph database implementation and likely Cassandra. Maybe Go to replace some burdensome node services.

When you need the next step after that (hundreds of millions of regular users), you'll have some custom solutions for your needs.

Stay modular and plan to replace everything you've built over time... that's the pragmatic way to go here.

Write a reply...

I have built a few of these, here's what has worked for me:

Tools:

  1. Hbase
  2. Elastic Search
  3. Hadoop Map Reduce

Methods -Chronological feed For timeline based events I have found that using Hbase is a great solution. Hbase ( https://hbase.apache.org/ ) is a column store built for random access to billions of rows X millions of columns. It has a single primary index in the form of the key you assign to the row, but has some really useful prefix filters and scanner operations. Its also fast enough to be a cache for hot data so most of your active records will be served through a LRU cache. One really useful feature for social network models is that it supports atomic counters which are very very fast (think 'like' counts). 2 examples:

  1. Get all content from a user in chronological order:

     **Key Structure** = [UUID][JAVA_INT_MAX-TIME][CONTENTTYPE][CONTENT_UUID]
     **Scanner** = PrefixFilter([USERUUID][FIRST-N-Digits-from-timestamp])
    

    Using this scanner would get your records back from a table containing hundreds of millions of entries in under 100ms. Ideally you would have the cells of the table reference content IDs from a separate content table that could essentially be a huge hashed lookup.

  2. Get precomputed stats for a user (Followers / Posts / Likes ) in real time: In a separate table just use the UUID as key and a schema like:

         [ROW] :: [ColFamily:Followers]:[Col:[UUID's of followers]],
                  [ColFamily:Following[UUIS's of following]],
                 [ColFamily:Stats]:[Col:following] = AtomicCounter 
                 [ColFamily:Stats]:[Col:posts] = AtomicCounter
    
    • Analysis: Using Hbase also gives you Hadoop Map / Reduce abilities against the datasets, although check with your ops people on the best way to configure a cluster to serve online traffic while running map / reduce, sometimes you have to split your cluster into to 2 to support this, but thats a different post. Basically any time there is some deep question, or you forgot to add a counter to an event, you can use map / reduce to process all of the events and recalculate things. Another really useful use case for this is when you need to build a ranked feed. This is where I typically use elastic search.

      -Ranked Feeds: With social network models, there is often a lot more interesting data in the leading 24 hrs of all feeds. Using map / reduce to analyze all of your data it is possible to create detailed elastic search indexes of the most interesting content across all feeds so long as you optimize for recency or some large reduction in the size of the data to just what you need in searchable space. If done correctly this can also work for individual users ranked feeds.

Using this approach doesn't totally solve the super node problem, but it will get you far enough for most smaller networks.

Write a reply...

Choose the tools/libs/framework you already know and that you're already comfortable with. Then I suggest you read the documentation on the Activity Streams website : http://activitystrea.ms/

Write a reply...

loading ...