1d ago · 8 min read · Have you ever experienced a product that passed functional tests but failed after thermal cycling, or developed intermittent failures after months in the field? Opening the enclosure reveals cracked s
Join discussion
2d ago · 10 min read · If you've never heard the term before, here's the short version: toil is the operational work that keeps your systems running today but does nothing to make them easier to run tomorrow. It's the 2 AM
Join discussion
2d ago · 10 min read · Reliability engineering used to be the exclusive domain of Site Reliability Engineers and infrastructure teams. But as backend developers take on more ownership of the services they build, from deploy
Join discussionMay 24 · 10 min read · The first fix lasted 90 seconds. We had corrected the Grafana datasource URL from prometheus:9999 back to prometheus:9090, watched the pod roll, refreshed the dashboard, and seen one panel come alive.
Join discussionMay 18 · 6 min read · We made it easier to use. Then it broke. I got pulled into an incident recently where one of our highest-value enterprise accounts, couldn't export their survey data. Their analytics pipeline had gone
YJacob commentedMay 14 · 5 min read · Broker APIs are powerful. They are also the kind of powerful where one careless script can make your day very interesting. So I built trade-ops-cli, a terminal-based broker operations tool designed ar
Join discussionMay 11 · 5 min read · How to Build a Self-Healing Python Script That Never Fails Meta description: Learn to build robust Python scripts with self-healing capabilities, ensuring continuous execution and minimizing downtime. Tags: Python scripting, self-healing scripts, err...
Join discussion