Comment by Eugene Chernysh on "Designing for Failure: Strategies to Build Resilient, Always-On Services"

Hi! Thanks for the insights. Just a quick question: in your experience, what has been the most challenging part of implementing chaos engineering in a production environment? how do you ensure it doesn’t cause disruptions for end users?

There wasn’t one "most challenging" part, but rather a series of hurdles we had to address. One challenge was ensuring chaos tests didn’t accidentally impact production or end-user experience, like simulating network latency without affecting live traffic. Another was managing non-critical dependencies that could still cause cascading issues. We solved these by setting up tight monitoring, isolating failure points, and ensuring recovery mechanisms like circuit breakers and rollbacks were ready to act, which allowed us to minimize disruptions.

Abhishek Vajarekar Sounds like a solid strategy. How do you prioritize which failure scenarios to test, and do you ever have to balance testing depth with the risk of introducing too many disruptions?

Eugene Chernysh We prioritize failure scenarios based on what’s most likely to go wrong and how it could affect the system. For exmaple, we first test things like network lag or databse issues, since those are pretty common. When it comes to balacing how deep we go with testing, we make sure to test in isolated enviroments first, with good monitoring to avoid messing up production. We focus on depth, but always make sure there are recovery steps ready if anything goes wrong.

Search Hashnode