Architecture : How would you go about building an activity feed like Facebook?View other answers to this thread
Learn Something New Everyday,
Connect With The Best Developers!
First of all I would recommend studying some of the great articles about feeds. I've attached my favorites at the bottom of this answer.
Start by reading this spec: activitystrea.ms/specs/atom/1.0 There is also an updated W3 spec coming up that makes some small changes.
A lot of the design decisions depend on what functionality you need:
- Notification/Realtime (Listen to changes in realtime)
- Aggregation (Ben and 3 people like your picture)
- Ranked Feeds (Sorting on more than just recency)
- Personalized feeds (Unique sorting logic based on the user. Often done using machine learning and analytics)
Fanout on read/ Fanout on write
Most feed systems either use fanout on read or fanout on write. Fanout on write is the more common choice. Instagram, Twitter, Stream all started out that way. Fanout on read is easier to build, but getting a good 99th percentile load time is really tricky.
This paper is a great introduction into the tradeoffs between those 2 approaches: Yahoo Research Paper
For most apps you will need to use a combination of push and pull at some point. That's what we do at getstream.io
Storage and message brokers
First of all you will need to pick a message broker for your fanout on write. My recommendation is RabbitMQ for mid size projects. If you have more time available Kafka is a great option. It scales much better than RabbitMQ. Unfortunately it is still really hard to use and maintain.
For storing the activity feeds I would recommend Cassandra. Many people start out with Redis, but it is very easy to run into limitations. Especially if you want to do aggregated feeds or otherwise need to store a lot of things in memory. Redis can get expensive very quickly. Cassandra is what Stream and Instagram use.
If you're building support for fanout on read I recommend either: Postgres, Redis or ElasticSearch. ElasticSearch can be tuned to do fanout on read very efficiently. (see this post Ranked feeds with ES) Redis is a good option for fanout on read if you use ZUnionStore. Postgres eventually breaks for a fanout on read approach. But you can tweak it to last for quite some time. (a really old HighScalability article about our approach at Fashiolista.)
Faye is a great open source project. In terms of hosted solutions PubNub and Pusher are awesome options.
If you need to cache some data simply use Redis. Redis also has a great locking implementation if you need to lock before writing to certain feeds. Try to avoid locking at all cost though.
Removing activities/Content checks
Eventually your users will post spam and inappropriate content. This is tricky to deal with as removing activities from all the feeds can take some time. Here's the common solution for this issue:
- Set a flag on the activity (ie. inappropriate=true, privacy=me)
- Filter these activities while reading the feed
- In the background run the delete command
You will want to have a higher priority for follows and direct inserts. Otherwise your users will be waiting for their feed to show :)
You task queueing system will have to support some level of priorities. This is easy to do with tools like Celery.