Hey awesome developers! :) Happy new year 2018.
What service(s) do you use for tracking errors, checking infrastructure health and alerting team members when production is on?
Many great options have already been mentioned, but I add a few items, since I'm surprised they haven't been mentioned yet.
If you don't want/can't afford the human cost of managing your own Prometheus, _Datadog__ _is a nice managed solution which isn't free but could help you focus on your core features before you can put more energy into saving costs and move to Prometheus. Sine the ops team in my current and previous companies are very very small, this has proven to be a real key item in monitoring and debugging our stacks. (And since they bought logmatic last year, their log solution to complete the metrics and apm tools should be released at some point this year, which would make it even easier to get started and focus on developing the product)
Finally, if you want a quite long list of available quality tools, The Cloud Native Foundation regularly updates their landscape stuff, in which the Observability and Analysis section should help you making sure you consider the best options available. CloudNativeLandscape_v1.0.png and the Github repo if you want to track updates: github.com/cncf/landscape
Side note: Why is this tagged General Programming and don't have the devops or architecture tags? (this is not a rant, I often find myself missing question because I'm getting lost in tags... )
Custom-built tools for everything integrated into Slack.
Bugsnag and sentry for few servers both integrated to slack. Keymetrics for monitoring nodejs servers and mongo db monitoring. Statuscake for uptime monitoring(reported to slack).
Tools we are using currently :
We use custom in house developed monitoring tool. Their is a client daemon installed on every server to collect server metrics, its difficult to monitor when you have 50+ servers :P .
We have certain filters in our code to check the load on application per server and to notify us when their is load more than threshold.
So,. basically we don't use anything fancy / opensource to monitor our infrastructure :D
We use Uptime robot for external check, and Zabbix for internal ones.
Uptime robot is great at periodically checking our public facing site. On the other hand, Zabbix can collect a lot of metrics and send alerts when something goes wrong.
We use
And I had tested Sentry for error reporting and I liked it very much, I use it for most of my side projects.
Has free options.
To monitor events from your server, like errors - bugsnag.com which integrates with Slack and, basically, just simple custom scripts sending me message in Slack if something happens or just general stats. Machine stats (RAM, Network, etc) is available from DigitalOcean.com monitoring UI and I receive emails everytime server is low on memory for example.
ravi rajus ai