Thread

Ankit Singhaniya

Full Stack Developer

Sep 14, 2017

What would be the tech stack of a high data volume system like Google analytics?

I was wondering if I am to build Google Analytics today, what stack should I be using or what is used by Google? The scale of the app may not be as large as google, but it should work for most general use cases.

The system will have following requirement:

The volume of data can be very high and variable
The data needs to be stored for a long period of time and should be query-able?
It should be cost effective(need not mention this)

I have some ideas like using:

I think serverless should be way to go, as it can scale up and down with ease and in a cost effective way
Database is the most confusing part, as it should be able to handle that much data while being effective and efficient. I have following options:
1. MongoDb - is queryable and is nosql should be able to handle large volumne of data
2. Cassandra - highly scalable made for high volumne data, but query? setup?
3. Postgresql - query will be charm, but can it handle the volume?
4. Elasticsearch - ??
I am also thinking that will I also need a service like Kafka or Kinesis on the frontline?

Am I missing any piece here? What will you choose to build this?

There are other products like keen.io and treasuredata? What tech stack should they be using currently?

#programming #system-architecture

Responses(5)

if you go for google scale, Hbase, Apache Spark, Redshift are more along the line what you gonna need. Even with services like docs.treasuredata.com/articles/quickstart it's still just for long term storage.

You need a high write throughput and if you think of analytic system I would go for a document based storage system since every request can be different and you probably need to scale on geo locations. So no pgsql it's possible yes ... but it's not it's purpose you can use it for the administration parts and things that require consistency.

Cassandra is a column based system it's nice and probably good for this usecase but I still would put a processing pipeline in front of it. something like

but that's just a basic idea, you need to think of fallbacks based on CAP / PACELC and you need a queue for buffering so you have realtime updates that can be delivered to the end user as well as sending it to the API.

We're talking eventbased / stream based design that has to be designed async with eventual consistency.

those are my initial thoughts.

Ankit Singhaniya I think since treasure is a service it provides several engines in the backgound, I know off cassandra but I only had a glimpse on it.

about the consumer -> you need a consumer for kafka, and the consumer is basically just pulling the stream and distributing it to different points of the application, it allows you to control the stream. And the consumers scale different than the logic needed. You don't need it it's just an architecture decision.

the document db is just to keep a query-able document storage so you could easily store the request and since they are usually JSON they are storable without transformations and it's eventual consistency... :D if you don't like the idea take it out :) as I mentioned it was just a top of my head idea.

Search Hashnode

What would be the tech stack of a high data volume system like Google analytics?

Responses(5)

Recent in Forum