everyone's obsessed with open telemetry but nobody talks about the operational nightmare

been running otel collectors in production for 8 months now. here's what nobody mentions: the memory bloat is absurd. our default otel-collector-contrib config was consuming 2.4gb per pod when load ramped up. we weren't even sampling, just basic traces and metrics.

switched to a minimal setup (just metric export, trace sampling at 5%) and yeah, memory dropped to 340mb. but that means you're blind to 95% of what's happening. we've missed actual bugs because they happened in the unsampled requests.

also the latency tax is real. added 80-120ms p95 overhead to api calls just from instrumentation. nobody benchmarks this honestly. everyone ships the otel quickstart and calls it done.

# what people do
tracer.start_as_current_span("expensive_operation")

# what you actually need to do
if should_sample():
    tracer.start_as_current_span("expensive_operation")

observability tooling is solving a real problem but the ecosystem treats it like it's free. it's not. you're trading cpu and latency for visibility and most teams don't budget for that.

Responses(5)

That's a real problem nobody wants to admit. The sampling dilemma is brutal. We hit something similar with Kafka lag metrics - the cardinality explosion from unfiltered dimensions tanked the collector.

What actually helped: sampling at ingest (reject noisy spans before they hit the collector) plus tail-based sampling for errors. Added a tiny gRPC service that looks at span status codes and forces 100% sampling on failures. Memory stayed under 600mb, and you catch real issues.

The tradeoff is operational complexity, not blind spots. But yeah, default otel is not production-ready without serious tuning.

That memory bleed is real. We hit it too. The contrib image ships with everything enabled by default, which is... not great for ops.

The trick we found: separate collectors by signal type. One lightweight instance just for metrics (Prometheus exporter, maybe 80mb), another for traces with aggressive sampling at ingestion (before buffering). That way you're not paying for unused processors.

On the sampling trade-off: 5% is too aggressive if you're catching production bugs. We do probabilistic sampling based on error status (100% on 5xx, 0.5% on 2xx). Costs maybe 15-20% more in ingestion but catches the actual failures.

What exporter are you pushing to. Some backends are way more expensive per span than others.

Thread

everyone's obsessed with open telemetry but nobody talks about the operational nightmare

Responses(5)

Recent in Forum

Search Hashnode

everyone's obsessed with open telemetry but nobody talks about the operational nightmare

Responses(5)

Recent in Forum