Comment by Jan Vladimir Mostert on "What tools/services do you use for monitoring your production servers?"

We have our own home-made tools, each application sends a heartbeat to a queue, our monitoring application picks up the heartbeat and registers the application as online upon the first heartbeat, as soon as it's no longer receiving heartbeats, it sounds the alarms.

Email / SMS applications will each monitor that the monitoring system is up and running, if not, they send an alert.

One application does a SELECT 1 FROM TABLE every 15 seconds, if it gets a result, send a heartbeat, if not, don't send it, if the monitoring application doesn't receive a heartbeat, it sounds the alarm that the database is offline or not responding.

We have an application pinging frontend applications from outside from different regions, if the frontend application receives the ping, it sends a heartbeat, if there's a network issues for example from Germany to our software, the monitor application, the frontend application will no longer receive pings from Germany, no longer send heartbeats labeled DE-PING and the monitor application will alert us that it's no longer receiving heartbeats from Germany via a certain frontend.

Based on the monitoring service, we can decide when to switch load, start up more instances etc.

We're planning to take it further by installing certain things on the VMs that will send CPU / Memory information every 15 seconds to the heartbeat queue so that the monitoring system can alert us on high-CPU / memory issues.

We're very close to have replicated what DataDog is offering funnily enough, but we have full control of everything.

Update: we also have a Canary system, since all our logging is centralised via queues, we have an application that can analyze the logs, if we get a spike in error logs after a deploy, we can quickly swing back to the old version, analyze logs and fix it before redeploying.

Simple application running on a small instance which has a scheduler in it doing a POST request every 30 seconds to a special URL on the frontend. Based on the incoming IP, we can see where the PING came from, if it's from Germany, we send a Heartbeat to queue labeled DE, if it's from the US, we send a Heartbeat to a queue labeled US, etc.

If the monitoring application now sees heartbeat from US and DE, we know everything is working as it should, if both US and DE heartbeats stop arriving at the monitor, chances are our frontend application is offline, if only US heartbeats stop, but DE heartbeats are still arriving, something is wrong between the US and our application and we can investigate or startup an instance in another region to see if the heartbeat picks back up.

Search Hashnode