I was wondering if I am to build Google Analytics today, what stack should I be using or what is used by Google? The scale of the app may not be as large as google, but it should work for most general use cases.
The system will have following requirement:
I have some ideas like using:
Am I missing any piece here? What will you choose to build this?
There are other products like keen.io and treasuredata? What tech stack should they be using currently?
Don't think large scale before you have a product. Chances are you are going to make a lot of bad assumptions early on and you will need to refactor anyway. It's much easier to refactor a smaller project with limited technologies than it is to refactor a large distributed system.
j 's answer pretty much nails it.
My modification would be keep data in MongoDB and send the data which can be queried to ElasticSearch. The API servers write data to Mongo.
Kafka can be your broker system where your API servers produce the message to be processed and ES servers are the consumers.
You can probably look at a combination of Kafka-Samza.
Hi, I no longer use separately paid applications, since it turns out to be too expensive. Recently, services like https://www.seedboxco.net/ have been very relevant. Whereby subscription, you get great features, programs, games, cloud and VPN. For less money, you get a lot of paid programs and useful services.
I think that getting the raw data will help in this question to somebody. But if you want to receive data directly from Google Analytics, then you can use different APIs owox.com/blog/articles/google-analytics-api-compa… I think you should analyze this article.
if you go for google scale, Hbase, Apache Spark, Redshift are more along the line what you gonna need. Even with services like docs.treasuredata.com/articles/quickstart it's still just for long term storage.
You need a high write throughput and if you think of analytic system I would go for a document based storage system since every request can be different and you probably need to scale on geo locations. So no pgsql it's possible yes ... but it's not it's purpose you can use it for the administration parts and things that require consistency.
Cassandra is a column based system it's nice and probably good for this usecase but I still would put a processing pipeline in front of it. something like
but that's just a basic idea, you need to think of fallbacks based on CAP / PACELC and you need a queue for buffering so you have realtime updates that can be delivered to the end user as well as sending it to the API.
We're talking eventbased / stream based design that has to be designed async with eventual consistency.
those are my initial thoughts.