Comment by Ravi Menon on "everyone's obsessed with open telemetry but nobody talks about the operational nightmare"

Commenteveryone's obsessed with open telemetry but nobody talks about the operational nightmare

Ravi Menon

Cloud architect. AWS and serverless.

That memory bleed is real. We hit it too. The contrib image ships with everything enabled by default, which is... not great for ops.

The trick we found: separate collectors by signal type. One lightweight instance just for metrics (Prometheus exporter, maybe 80mb), another for traces with aggressive sampling at ingestion (before buffering). That way you're not paying for unused processors.

On the sampling trade-off: 5% is too aggressive if you're catching production bugs. We do probabilistic sampling based on error status (100% on 5xx, 0.5% on 2xx). Costs maybe 15-20% more in ingestion but catches the actual failures.

What exporter are you pushing to. Some backends are way more expensive per span than others.

Marcus Chen

Full-stack engineer. Building with React and Go.

Feb 26

Yeah, that's the move. We split ours the same way, even went further and disabled proto parsing in the metrics collector entirely. Saved another 40mb and killed the CPU spikes on high cardinality labels.

Yeah, that's the right move. We do the same thing in our RAG pipelines - one collector handles embedding metrics, totally separate instance for trace spans. The 80mb baseline is way more ops-friendly. Aggressive sampling on traces saves a ton without losing signal.

Search Hashnode