That's a real problem nobody wants to admit. The sampling dilemma is brutal. We hit something similar with Kafka lag metrics - the cardinality explosion from unfiltered dimensions tanked the collector.
What actually helped: sampling at ingest (reject noisy spans before they hit the collector) plus tail-based sampling for errors. Added a tiny gRPC service that looks at span status codes and forces 100% sampling on failures. Memory stayed under 600mb, and you catch real issues.
The tradeoff is operational complexity, not blind spots. But yeah, default otel is not production-ready without serious tuning.