What measures do you employ to make sure your web infrastructure is running reliably?

Before deployment:

Check test coverage: It goes without saying that TDD should be the norm and developers should pro-actively write test cases. New features should never break old ones.
Perform load testing: A must if you work at high scale.
Measure performance: This is required for you to know what kind of instances and resources you would need to boost performance.

After deployment:

Monitoring: You can use various tools for this. We use New Relic to monitor our services.
Logging: Contextual logging is very important. You need to be able to check the lifecycle of a request as it was completed.
Alerting : We use 2 kinds of alerting mechanisms
1. Soft Alerts: Over email or slack. These are not critical alerts but more like bringing something to your attention. Like some requests are taking 1000 - 2000 ms to respond.
2. Hard Alerts: You get a phone call or message about serious issues. Like server crashes, high latency (>2000 ms)

Without any details about the elements of your infrastructure, our answers could only be generic. But no matter if you're running a big data pipeline, a commercial website, an online game or a business app, the health of any system can be inferred from 4 golden signals

And because the idea is old and well documented, here's an extract from Google's authored SRE book from O'Reilly: «Site Reliability Engineering: How Google Runs Production Systems»

Latency

The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

Errors

The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.

Saturation

How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., “Give me a nonce” or “I need a globally unique monotonic integer”) that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation. Finally, saturation is also concerned with predictions of impending saturation, such as “It looks like your database will fill its hard drive in 4 hours.”

If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring.

What will be the metrics to monitor to have the best view of your system latency depends on the nature of your system. And for complex architectures you'll need a few (each sub system will have differents metrics for these signals, but in the end it always come down to these. IO Saturation, Memory Swap, dropped requests are different signals from different systems, but all show some saturation)

Thread

What measures do you employ to make sure your web infrastructure is running reliably?

Responses(2)

Before deployment:

After deployment:

Recent in Forum

Search Hashnode

What measures do you employ to make sure your web infrastructure is running reliably?

Responses(2)

Before deployment:

After deployment:

Recent in Forum