Architecture : How would you go about building an activity feed like Facebook?

View original thread
Thierry Schellenbach's photo

First question I'm answering on Hashnode. I'm the author of Stream-Framework and run getstream.io for a living. I dream about feeds :)

First of all I would recommend studying some of the great articles about feeds. I've attached my favorites at the bottom of this answer.

Naming conventions

Start by reading this spec: activitystrea.ms/specs/atom/1.0 There is also an updated W3 spec coming up that makes some small changes.

Functionality

A lot of the design decisions depend on what functionality you need:

  • Notification/Realtime (Listen to changes in realtime)
  • Aggregation (Ben and 3 people like your picture)
  • Ranked Feeds (Sorting on more than just recency)
  • Personalized feeds (Unique sorting logic based on the user. Often done using machine learning and analytics)

Fanout on read/ Fanout on write

Most feed systems either use fanout on read or fanout on write. Fanout on write is the more common choice. Instagram, Twitter, Stream all started out that way. Fanout on read is easier to build, but getting a good 99th percentile load time is really tricky.

This paper is a great introduction into the tradeoffs between those 2 approaches: Yahoo Research Paper

For most apps you will need to use a combination of push and pull at some point. That's what we do at getstream.io

Storage and message brokers

First of all you will need to pick a message broker for your fanout on write. My recommendation is RabbitMQ for mid size projects. If you have more time available Kafka is a great option. It scales much better than RabbitMQ. Unfortunately it is still really hard to use and maintain.

For storing the activity feeds I would recommend Cassandra. Many people start out with Redis, but it is very easy to run into limitations. Especially if you want to do aggregated feeds or otherwise need to store a lot of things in memory. Redis can get expensive very quickly. Cassandra is what Stream and Instagram use.

If you're building support for fanout on read I recommend either: Postgres, Redis or ElasticSearch. ElasticSearch can be tuned to do fanout on read very efficiently. (see this post Ranked feeds with ES) Redis is a good option for fanout on read if you use ZUnionStore. Postgres eventually breaks for a fanout on read approach. But you can tweak it to last for quite some time. (a really old HighScalability article about our approach at Fashiolista.)

Realtime

Faye is a great open source project. In terms of hosted solutions PubNub and Pusher are awesome options.

Caching/Locking

If you need to cache some data simply use Redis. Redis also has a great locking implementation if you need to lock before writing to certain feeds. Try to avoid locking at all cost though.

Removing activities/Content checks

Eventually your users will post spam and inappropriate content. This is tricky to deal with as removing activities from all the feeds can take some time. Here's the common solution for this issue:

  • Set a flag on the activity (ie. inappropriate=true, privacy=me)
  • Filter these activities while reading the feed
  • In the background run the delete command

Priority queues

You will want to have a higher priority for follows and direct inserts. Otherwise your users will be waiting for their feed to show :)

You task queueing system will have to support some level of priorities. This is easy to do with tools like Celery.

Design Resources

getstream.io/activity-feed-design

getstream.io/based-feed-ui-kit-sketch

getstream.io/blog/13-tips-for-a-highly-enga..

Articles

Twitter 2013 Redis based Etsy feed scaling LinkedIn ranked feeds Facebook history

Activity stream specification

FriendFeed approach

Yahoo Research Paper

Twitter’s approach Cassandra at Instagram

Relevancy at Etsy

Zite architecture overview

Ranked feeds with ES

Riak at Xing - by Dr. Stefan Kaes & Sebastian Röbke

Riak and Scala at Yammer

My projects

Stream-Framework

getstream.io