The Problem
A global payments processor running 400+ microservices hit a critical scaling wall. Their monitoring stack — cobbled together from three generations of tooling — was generating more noise than signal. Telemetry costs were outpacing revenue growth while latency spikes went undetected until customer complaints arrived.
The core failure: no unified collection layer, per-service cardinality explosions, and a tracing pipeline that sampled everything or nothing.
Architecture Decision
The fundamental trade-off was cost vs. fidelity. Full trace capture at 400+ services was economically unviable. The solution was a tiered telemetry model:
- Ingestion: OpenTelemetry collector fleet with tail-based sampling
- Processing: Stream processing to classify traffic health in real time
- Storage: Hot/cold tiering — 72h full fidelity, 30d aggregates only
- Viz: Grafana with pre-computed SLO dashboards per service
Key trade-offs accepted:
- 5 second lag on alerting in exchange for 40% cost reduction
- Sampled tracing for healthy traffic (100% capture on error paths)
The Outcome
Deploying the unified OTel collector layer with intelligent sampling reduced MTTD from ~4 hours to under 60 seconds. Annual observability OpEx dropped by $2.4M through hot/cold storage tiering and cardinality controls.
The architecture is now the internal standard for all new service onboarding across the organisation.