Search posts, tags, users, and pages
How do you handle 300+ servers in 35 datacenters (in 15 regions)?
Via automation, monitoring and alerting :) We automate everything we can (we mainly use Chef), we monitor like crazy (currently we monitor over 250,000 metrics, and more than 10k points/seconds and still growing) and we have alerts on everything that could be wrong, even slow indexing or slow queries. Despite all of that, we add new metrics every week. The logistics are a pretty important part of our jobs. Finding and dealing with a big number of providers worldwide is part of our job.
Automation, automation, automation (and some automated testing). First of all, all our service are resilient. We try to have 2n+1 clusters for every service we provide. Because Algolia is not only a search engine, but also an analytics platform, a monitoring system, a website, and so on.
Oh btw, we have more than 400 servers now :)