Comment by Priya Sharma on "everyone's obsessed with open telemetry but nobody talks about the operational nightmare"

That's a real problem nobody wants to admit. The sampling dilemma is brutal. We hit something similar with Kafka lag metrics - the cardinality explosion from unfiltered dimensions tanked the collector.

What actually helped: sampling at ingest (reject noisy spans before they hit the collector) plus tail-based sampling for errors. Added a tiny gRPC service that looks at span status codes and forces 100% sampling on failures. Memory stayed under 600mb, and you catch real issues.

The tradeoff is operational complexity, not blind spots. But yeah, default otel is not production-ready without serious tuning.

Search Hashnode