Discussion on "Designing for Failure: Strategies to Build Resilient, Always-On Services"

Abhishek Vajarekar · 2023-11-14T20:00:00.000Z

In a world where software systems are increasingly distributed and expected to operate at scale, system failures are inevitable. Systems that don't plan for these potential breakdowns risk prolonged downtime, data loss, and diminished user trust. For...

Hi! Thanks for the insights. Just a quick question: in your experience, what has been the most challenging part of implementing chaos engineering in a production environment? how do you ensure it doesn’t cause disruptions for end users?

Abhishek Vajarekar Sounds like a solid strategy. How do you prioritize which failure scenarios to test, and do you ever have to balance testing depth with the risk of introducing too many disruptions?

Eugene Chernysh We prioritize failure scenarios based on what’s most likely to go wrong and how it could affect the system. For exmaple, we first test things like network lag or databse issues, since those are pretty common. When it comes to balacing how deep we go with testing, we make sure to test in isolated enviroments first, with good monitoring to avoid messing up production. We focus on depth, but always make sure there are recovery steps ready if anything goes wrong.

Redundancy and failover mechanisms are crucial when building resilient systems, especially as systems scale. I’m curious about how you approach balancing the complexity of setting up these mechanisms with the need for cost efficiency. Have you found any trade-offs when it comes to choosing between geographic redundancy versus region-based failovers, especially in terms of response time and infrastructure cost?

When choosing between geographic redundancy and region-based failovers, the main trade-off is latency versus cost. Geographic redundancy offers higher availability across continents but comes with higher costs and potential latency. Region-based failovers are cheaper, with lower latency, and are ideal for most systems that don’t require global disaster recovery. The decision depends on your business needs and risk tolerance—if global availability is crucial, go for geographic redundancy; otherwise, region-based failovers usually provide a good balance.

👍👍👍👍👍

Search Hashnode

Designing for Failure: Strategies to Build Resilient, Always-On Services

Responses(4)